Oh sure, if a copyright holder can demonstrate that a specific work is reproduced. Not just “I think your AI read my book and that’s why it’s so good at carpentry.”
Reproducing a work is a specific thing. Using an idea from that work, or a transformation of that idea, is not reproducing that work.
Again: If a copyright holder can show that an AI system has reproduced the text (or images, etc.) of a specific work, they should absolutely have a copyright claim.
But “you read my book, therefore everything you do is a derivative work of my book” is an incorrect legal argument. And when it escalates to “… and therefore I should get to shut you down,” it’s a threat of censorship.
The problem is that the LLMs (and image AIs) effectively store pieces of works as correlations inside them, occasionally spitting some of them back out. You can’t just say “it saw it” but can say “it’s like a scrapbook with fragments of all these different works”
If that reasoning held, then every web browser, search engine bot, etc. would be violating copyright every time it accessed a web page, because doing so involves making a copy in memory.
Making an internal copy isn’t the same as publishing, performing, etc. a work.
There’s an implied license to use content for the purpose of displaying it for web content. Copies for other purposes…not so much. There have been a whole series of lawsuits over the years over just how much you can copy for what purpose.
A person reading and internalizing concepts is considerably different than an algo slurping in every recorded work of fiction and occasionally shitting out a bit of mostly Shakespeare. One of these has agency and personhood, the other is a tool.
That article doesn’t show what you think it shows. There was a lot of discussion of it when it first came out and the examples of overfitting they managed to dig up were extreme edge cases of edge cases that took them a huge amount of effort to find. So that people don’t have to follow a Reddit link, from the top comment:
They identified images that were likely to be overtrained, then generated 175 million images to find cases where overtraining ended up duplicating an image.
We find 94 images are extracted. […] [We] find that a further 13 (for a total of 109 images) are near-copies of training examples
They’re purposefully trying to generate copies of training images using sophisticated techniques to do so, and even then fewer than one in a million of their generated images is a near copy.
And that’s on an older version of Stable Diffusion trained on only 160 million images. They actually generated more images than were used to train the model.
Overfitting is an error state. Nobody wants to overfit on any of the input data, and so the input data is sanitized as much as possible to remove duplicates to prevent it. They had to do this research on an early Stable Diffusion model that was already obsolete when they did the work because modern Stable Diffusion models have been refined enough to avoid that problem.
If I was to read a carpentry book and then publish my own, “regurgitating” most of the original text, then I plagiarized and should be sued. Furthermore, if I was to write a song and use the same melody as another copyrighted song I’d get sued and lose, even if I could somehow prove that I never heard the original.
I think the same rules should apply to AI generated content. One rule I would like to see, and I don’t know if this has precedent, is that AI generated content cannot be copyrighted. Otherwise AI could truly replace humans from a creative perspective and it would be a race to generate as much content as possible.
A key difference is that AI models tend to contain actual pieces of the training data, and on occasion regurgitate it. Kind of like randomly reproducing parts of the book during the course of your career as a carpenter. That’s the kind of thing that actually results in copyright lawsuits and damages when real people do it. AI shouldn’t be getting a pass here.
Oh sure, if a copyright holder can demonstrate that a specific work is reproduced. Not just “I think your AI read my book and that’s why it’s so good at carpentry.”
The thing is that they’re all reproduced, at least in part. That’s how these models work.
Reproducing a work is a specific thing. Using an idea from that work, or a transformation of that idea, is not reproducing that work.
Again: If a copyright holder can show that an AI system has reproduced the text (or images, etc.) of a specific work, they should absolutely have a copyright claim.
But “you read my book, therefore everything you do is a derivative work of my book” is an incorrect legal argument. And when it escalates to “… and therefore I should get to shut you down,” it’s a threat of censorship.
The problem is that the LLMs (and image AIs) effectively store pieces of works as correlations inside them, occasionally spitting some of them back out. You can’t just say “it saw it” but can say “it’s like a scrapbook with fragments of all these different works”
I’ve memorized some copyrighted works too.
If I perform them publicly, the copyright holder would have a case against me.
But the mere fact that I could recite those works doesn’t make everything that I say into a copyright violation.
The copyright holder has to show that I’ve actually reproduced their work, not just that I’ve memorized it inside my brain.
The difference is that your brain isn’t a piece of media which gets copied. The AI is. So when it memorizes, it commits a copyright violation
If that reasoning held, then every web browser, search engine bot, etc. would be violating copyright every time it accessed a web page, because doing so involves making a copy in memory.
Making an internal copy isn’t the same as publishing, performing, etc. a work.
There’s an implied license to use content for the purpose of displaying it for web content. Copies for other purposes…not so much. There have been a whole series of lawsuits over the years over just how much you can copy for what purpose.
No, it doesn’t. Learning from copyrighted material is black and white fair use.
The fact that the AI isn’t intelligent doesn’t matter. It’s protected.
A person reading and internalizing concepts is considerably different than an algo slurping in every recorded work of fiction and occasionally shitting out a bit of mostly Shakespeare. One of these has agency and personhood, the other is a tool.
No, that’s not how these models work. You’re repeating the old saw about these being “collage machines”, which is a gross mischaracterization.
That article doesn’t show what you think it shows. There was a lot of discussion of it when it first came out and the examples of overfitting they managed to dig up were extreme edge cases of edge cases that took them a huge amount of effort to find. So that people don’t have to follow a Reddit link, from the top comment:
Overfitting is an error state. Nobody wants to overfit on any of the input data, and so the input data is sanitized as much as possible to remove duplicates to prevent it. They had to do this research on an early Stable Diffusion model that was already obsolete when they did the work because modern Stable Diffusion models have been refined enough to avoid that problem.
If I was to read a carpentry book and then publish my own, “regurgitating” most of the original text, then I plagiarized and should be sued. Furthermore, if I was to write a song and use the same melody as another copyrighted song I’d get sued and lose, even if I could somehow prove that I never heard the original.
I think the same rules should apply to AI generated content. One rule I would like to see, and I don’t know if this has precedent, is that AI generated content cannot be copyrighted. Otherwise AI could truly replace humans from a creative perspective and it would be a race to generate as much content as possible.