Although I’m a firm believer that most AI models should be public domain or open source by default, the premise of “illegally trained LLMs” is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.
The idea of… well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.
The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn’t tip further into their favor to the point AI technology only exists to benefit them.
If the model is built on the corpus of humanity, then humanity should benefit.
OpenAI hasn’t disclosed the datasets that ChatGPT is trained on, but in an older paper two databases are referenced; “Books1” and “Books2”. The first one contains roughly 63,000 titles and the latter around 294,000 titles.
These numbers are meaningless in isolation. However, the authors note that OpenAI must have used pirated resources, as legitimate databases with that many books don’t exist.
Should be easy to defend against, right-out trivial: OpenAI, just tell us what those Books1 and Books2 databases are. Where you got them from, the licensing contracts with publishers that you signed to give you access to such a gigantic library. No need to divulge details, just give us information that makes it believable that you licensed them.
…crickets. They pirated the lot of it otherwise they would already have gotten that case thrown out. It’s US startup culture, plain and simple, “move fast and break laws”, get lots of money, have lots of money enabling you to pay the best lawyers to abuse the shit out of the US court system.
the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world
They are not “analyzing” the data. They are feeding it into a regurgitating mechanism. There’s a big difference. Their defense is only “good” because AI is being misrepresented and misunderstood.
I agree that we shouldn’t strive for more strict copyright. We should fight for a much more liberal system. But as long as everyone else has to live by the current copyright laws, we should not let AI companies get away with what they’re doing.
There are law offices that exist specifically to fuck with people over patent and copywrite law.
There’s also cases where people use copywrite and patent law to hold us back. I can’t find the article but some religious jerk patented connecting a sex toy to a computer via USB. Thankfully someone got around this law with bluetooth and cell phones. Otherwise I imagine the camgirl and LDR market for toys would’ve been hit with products 10 years sooner.
I’ve never really delved into the AI copyright debate before, so forgive my ignorance on the matter.
I don’t understand how an AI reading a bunch of books and rearranging some of those words into a new story, is different to a human author reading a bunch of books and rearranging those words into a new story.
Most AI art I’ve seen has been… Unique, to say the least. To me, they tend to be different enough to the art they were trained in to not be a direct ripoff, so personally I don’t see the issue.
The for-profit large-scale media blender is the problem. When it’s a human writing Harry Potter fan fiction, it’s fine. When a company sells a tool for you to write thousands of trash “books” for profit, it’s a problem.
ML algorithms aren’t capable of producing anything new, they can only ever produce a mishmash of copies of existing works.
If you feed a generative model a bunch of physics research papers, it won’t create a new valid physics research paper, just a mishmash of jargon from existing papers.
Banning AI is out of the question. Even the EU accepts that and they tend to be pretty ban heavy, unlike the US.
But it’s important that we have these discussions about how copyright applies to AI so that we can actually get an answer and move on, right now it’s this legal quagmire that no one really wants to get involved in except the big companies. If a small group of university students want to build an AI right now they can’t because of the legal nightmare that would be the Twilight zone of law that is acquiring training data.
AI is right-out unregulated in the EU unless and until you actually use it for something where it becomes relevant, then you’ve got at the lower end labelling requirements (If your customer service is an AI chat, say that it’s an AI chat), up to heavy, heavy requirements when you use it for stuff like sifting through job applications. The burden of proof that the AI isn’t e.g. racist is on you. Or, for that matter, using to reject health insurance claims, I think we saw some news lately out of the US what can happen when you do that.
OpenAI’s copyright case isn’t really good to make the legal situation any clearer: We already know that using pirated content to train stuff isn’t legal because you’re not looking at it legitimately. The case isn’t about the “are computers allowed to learn from public sources just as humans are” question.
Although I’m a firm believer that most AI models should be public domain or open source by default, the premise of “illegally trained LLMs” is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.
The idea of… well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.
The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn’t tip further into their favor to the point AI technology only exists to benefit them.
If the model is built on the corpus of humanity, then humanity should benefit.
As per torrentfreak
Should be easy to defend against, right-out trivial: OpenAI, just tell us what those Books1 and Books2 databases are. Where you got them from, the licensing contracts with publishers that you signed to give you access to such a gigantic library. No need to divulge details, just give us information that makes it believable that you licensed them.
…crickets. They pirated the lot of it otherwise they would already have gotten that case thrown out. It’s US startup culture, plain and simple, “move fast and break laws”, get lots of money, have lots of money enabling you to pay the best lawyers to abuse the shit out of the US court system.
They are not “analyzing” the data. They are feeding it into a regurgitating mechanism. There’s a big difference. Their defense is only “good” because AI is being misrepresented and misunderstood.
I agree that we shouldn’t strive for more strict copyright. We should fight for a much more liberal system. But as long as everyone else has to live by the current copyright laws, we should not let AI companies get away with what they’re doing.
Not to mention patent laws are bullshit.
There are law offices that exist specifically to fuck with people over patent and copywrite law.
There’s also cases where people use copywrite and patent law to hold us back. I can’t find the article but some religious jerk patented connecting a sex toy to a computer via USB. Thankfully someone got around this law with bluetooth and cell phones. Otherwise I imagine the camgirl and LDR market for toys would’ve been hit with products 10 years sooner.
I’ve never really delved into the AI copyright debate before, so forgive my ignorance on the matter.
I don’t understand how an AI reading a bunch of books and rearranging some of those words into a new story, is different to a human author reading a bunch of books and rearranging those words into a new story.
Most AI art I’ve seen has been… Unique, to say the least. To me, they tend to be different enough to the art they were trained in to not be a direct ripoff, so personally I don’t see the issue.
The for-profit large-scale media blender is the problem. When it’s a human writing Harry Potter fan fiction, it’s fine. When a company sells a tool for you to write thousands of trash “books” for profit, it’s a problem.
ML algorithms aren’t capable of producing anything new, they can only ever produce a mishmash of copies of existing works.
If you feed a generative model a bunch of physics research papers, it won’t create a new valid physics research paper, just a mishmash of jargon from existing papers.
Banning AI is out of the question. Even the EU accepts that and they tend to be pretty ban heavy, unlike the US.
But it’s important that we have these discussions about how copyright applies to AI so that we can actually get an answer and move on, right now it’s this legal quagmire that no one really wants to get involved in except the big companies. If a small group of university students want to build an AI right now they can’t because of the legal nightmare that would be the Twilight zone of law that is acquiring training data.
AI is right-out unregulated in the EU unless and until you actually use it for something where it becomes relevant, then you’ve got at the lower end labelling requirements (If your customer service is an AI chat, say that it’s an AI chat), up to heavy, heavy requirements when you use it for stuff like sifting through job applications. The burden of proof that the AI isn’t e.g. racist is on you. Or, for that matter, using to reject health insurance claims, I think we saw some news lately out of the US what can happen when you do that.
OpenAI’s copyright case isn’t really good to make the legal situation any clearer: We already know that using pirated content to train stuff isn’t legal because you’re not looking at it legitimately. The case isn’t about the “are computers allowed to learn from public sources just as humans are” question.