- cross-posted to:
- technology@lemmit.online
- cross-posted to:
- technology@lemmit.online
The New York Times sues OpenAI and Microsoft for copyright infringement::The New York Times has sued OpenAI and Microsoft for copyright infringement, alleging that the companies’ artificial intelligence technology illegally copied millions of Times articles to train ChatGPT and other services to provide people with information – technology that now competes with the Times.
There is something wrong when search and AI companies extract all of the value produced by journalism for themselves. Sites like Reddit and Lemmy also have this issue. I’m not sure what the solution is. I don’t like the idea of a web full of paywalls, but I also don’t like the idea of all the profit going to the ones who didn’t create the product.
What’s the value of old journalism?
It’s a product where the value curve is heavily weighted towards recency.
In theory, the greatest value theft is when the AP writes a piece and two dozen other ‘journalists’ copy the thing changing the text just enough not to get sued. Which is completely legal, but what effectively killed investigative journalism.
A LLM taking years old articles and predicting them until it can effectively learn relationships between language itself and events described in those articles isn’t some inherent value theft.
It’s not the training that’s the problem, it’s the application of the models that needs policing.
Like if someone took a LLM, fed it recently published news stories in the prompts with RAG, and had it rewrite them just differently enough that no one needed to visit the original publisher.
Even if we have it legal for humans to do that (which really we might want to revisit, or at least create a special industry specific restriction regarding), maybe we should have different rules for the models.
But to try to claim a LLM that’s allowing coma patients to communicate or to problem solve self-driving algorithms or to diagnose medical issues is stealing the value of old NYT articles in its doing so is not really an argument I see much value in.
Except no one is claiming that LLMs are the problem, they’re claiming GPT, or more specifically GPTs training data, is the problem. Transformer models still have a lot of potential, but the question the NYT is asking is “can you just takes anyone else’s work to train them”.
There’s a similar suit against Meta for Llama.
And yes, we will end up seeing as the dust settles if training a LLM is fair use in case law.
Really gave me a whole new perspective. Thanks for that.
but I also don’t like the idea of all the profit going to the ones who didn’t create the product.
Should… should we tell him?
Tell them instead of mocking them.
Yes, “that’s how the world works”. But doesn’t mean we should stop trying to change it.
The solution is imposing to these companies the responsibility of tracking their profit per media, tax them and redistribute that money based on the tracking info. They’re able to track all the pages you visit, it’s complete bullshit when they say they don’t know how much they make for each places their ads are displayed.
AI training is piracy by another name.
Elaborate. Consumption of copyrighted materials is normal use whether by a human or a machine.
Taking someone else’s work and using it without crediting them or compensating them is theft. If Open AI made a deal with The NY Times to train its product using the papers content, which it would turn around and sell to its own customer base, that would be ethical. What Open AI and other companies like it are doing are stealing ahead of actual law that defines what they’re doing as such.
So listening to Billie Jean without thanking Michael Jackson is theft? That is use.
How about Billie Jean’s baseline which is borrowed from Hall and Oates I Can’t Go For That. Was that theft? Michael felt guilty about it but John felt it was routine for creatives to borrow from each other all the time.
How about money- and lobbyist-inspired extensions of copyright so extreme that both songs (heck, the whole oupuses of both artists) have been denied from the public domain? Is that theft too? Or does it only count when companies and rich estates are denied profits?
From your copyright infringement is theft blanket assertion and your inability or refusal to parse out fair use of copyrighted materials, I infer you don’t actually understand what copyright is or what purpose it is meant to serve to the public. You are just regurgitating the maximalist rhetoric you’ve been spoonfed. Its really kinda sad.
Feel free to exercise more nuance. Or if you like you can double down and remove all doubt.
Using a tool to copy someone else’s work and then profiting off that work without compensating or even attributing the source is stealing.
Your argument poses an interesting thought. Do machines have a right to fair use?
Humans can consume for the sake of enjoyment. Humans can consume without a specific purpose of compiling and delivering that information. Humans can do all this without having a specific goal of monetary gain. Software created by a for-profit privately held company is inherently created to consume data with the explicit purpose of generating monetary value. If that is the specific intent and design then all contributors should be compensated.
Then again, we can look no further than Google (the search engine, not the company) for an example that’s a closely related to the current situation. Google can host excerpts of data from billions of websites and serve that data up upon request without compensating those site owners in any way. I would argue that Google is different though because it literally cites every single source. A search result isn’t useful if we don’t know what site the result came from.
And my final thought - are works that AI generates is truly transformative? I can see arguments that go either way.
Do machines have a right to fair use?
Machines do not have rights or obligations. They cannot be held liable to pay damages or be sentenced for crimes. They cannot commit copyright infringement. But I don’t think we’ll see “the machine did it” as a defense in court.
are works that AI generates is truly transformative?
Usually they are original and not transformative.
Transformative implies that there is some infringement going on. Say, you make a cartoon with the recent Mickey Mouse. But instead of making the same kind of cartoon as Disney would, you use MM to criticize the policies of the Disney corporation (like South Park did). That transforms the work.
Sometimes AI spits out verbatim copies of training data. That is usually transformative. A couple pages of Harry Potter turn into a technical malfunction.
I hope you’ll answer a question in return:
Software created by a for-profit privately held company is inherently created to consume data with the explicit purpose of generating monetary value. If that is the specific intent and design then all contributors should be compensated.
Why? What’s the ethical/moral justification for this?
I know how anarcho-capitalists, so-called libertarians, and other such ideologies see it, but perhaps you have a different take. These groups are also not necessarily on board with the whole intellectual property concept. So that’s what I am curious about. Full disclosure: I am absolutely not on board with that kind of thinking and am unlikely to be convinced. But I am genuinely interested in learning more.
Just getting back around to this.
My main reasoning is simply that authors and artists should be fairly credited and compensated for their work. If I create something and share it on the internet, I don’t necessarily want a company to make money on that thing, especially if they’re making money to my exclusion.
So while I belive that IP as we know it today is probably not be the best way to handle things, I still think creators should have some say over how their works are used and should receive some reasonable share when their works are used for profit. Without creators, those works wouldn’t exist in the first place.
Are there other jobs where it would be okay to take a person’s services without paying them? What would motivate people to continue providing those services?
Not the original comment but I think the difference you’re looking for is in the copying and distribution. The OC makes the false assumption that the data set is full copies of every object fed into it rather than sets of common characteristics.
For example, my own mind has a concept tree. Tree is not a copy of every tree I’ve ever known but more like lists of common characteristics that define treeness based on information I’ve gathered about treeness (my data set).
Piracy is piracy not because of how it’s consumed, but rather, how it’s distributed and stored, as full copies of the object. Datasets are not copies, in other words. And thus copyright doesn’t apply.
Reading an article to get an idea about what articleness is, is fair use. Reading an article to reproduce it verbatim is not. And as of now, I don’t believe LLMs are doing the later.
Ai isn’t creating the product. It consumed it.
My question is how is an AI reading a bunch of articles any different from a human doing it. With this logic no one would legally be able to write an article as they are using bits of other peoples work they read that they learnt to write a good article with.
They are both making money with parts of other peoples work.
It was thought that the LLM wouldn’t keep the actual data internally verbatim. If you can memorize an article, and recite it to everyone free of charge, technically it’s plagiarism. Same if you sing a song to a crowd when you don’t have the rights.
The Google research (and other discovery) proved that you can actually extract verbatim training data from a LLM. Which has a lot of implications for copyright.
The physical limitations are an important difference. A human can only read and remember so much material. With AI, you can scale that exponentially with more compute resources. Frankly, IP law was not written with this possibility in mind and needs to be updated to find a balance.
Let me ask you this: when have you ever seen ChatGPT cite its sources and give appropriate credit to the original author?
If I were to just read the NYT and make money by simply summarizing articles and posting those summaries on my own website without adding anything to it like my own commentary and without giving credit to the author, that would rightfully be considered plagiarism.
This is a really interesting conundrum though. I would argue that AI isn’t capable of original thought the way that humans are and therefore AI creators must provide due compensation to the authors and artists whose data they used.
AI is only giving back some amalgamation of words and concepts that it has been trained on. You might say that humans do the same, but that isn’t exactly true. The human brain is a funny thing. It can forget, it can misremember. It can manipulate. It can exaggerate. It can plan. It can have irrational or emotional responses. AI can’t really do those things on its own. It’s just mimicking human behavior at best.
Most importantly to me though, AI is not capable of spontaneous thought. It is only capable of providing information that it has been trained on and only when prompted.
Let me ask you this: when have you ever seen ChatGPT cite its sources and give appropriate credit to the original author?
Bing chat now does that by default. Normally you have to prompt that manually.
If I were to just read the NYT and make money by simply summarizing articles and posting those summaries on my own website without adding anything to it like my own commentary and without giving credit to the author, that would rightfully be considered plagiarism.
No. It would be considered journalism. If you read the news a bit, you will find that they reference the output of other news corporations quite a bit. If your preferred news source does not do that, then they simply don’t cite their sources.
Prompting for a source wouldn’t satisfy me until I could trust that the AI wasn’t hallucinating. After all, if GPT can make up facts about things like legal precedent or well documented events, why would I trust that its citations are legitimate?
And if the suggestion is that the person asking for the information double check the cited sources, maybe that’s reasonable to request, but it somewhat defeats the original purpose.
Bing might be doing things differently though, so you might be right in your assessment on that front. I haven’t played with their AI yet.
You did ask if ChatGPT had ever sighted sources. Bing uses it and besides, you can ask for that manually.
Whether it defeats the purpose depends on your original purpose.
There is evidence to suggest some LLM’s have the ability to produce original outputs, such as DeepMind’s solution to the cap set problem.
https://www.nature.com/articles/s41586-023-06924-6
On the other hand LLM’s have some incredible text compression abilities
https://arxiv.org/abs/2308.07633
I’m pretty sure there is copyright infringement going on by the letter of the law. But I also think the world would be better off if copyright laws were a bit more loose. Not wild-west anything-goes libertarianism, but more open than the current state.
I tend to agree with your last point, especially because of the way the system has been bastardized over the years. What started out as well intentioned legislation to ensure that authors and artists maintain control over their work has become a contentious and litigious minefield that barely protects creators.
An AI does not learn like a human does. Therefore the same laws and principles can’t be applied to computer “learning” as can be to human learning.
They’re fundamentally different uses of the material.
If this lawsuit causes it to be ILLEGAL to read anything you buy because you could plagiarize it, Bradbury is gonna spin in his fucking grave.
I think the important difference in this case is like the difference between a human enjoying a song that they hear being performed vs a company recording a song that someone is performing and then replaying that song on demand for paying customers.
Except, it’s not replaying those song exactly,
- not even in their entirety. It’s taking a few notes from here and there, arranges them in a way what makes sense, and effectively performing a “new” song - which isn’t all that different from a human artist who is “inspired” by the works of other artists and produces a new work in the same genre.
The main difference being the volume. An example I like is how Google trained his gaming AI to starcraft 2. This AI was able to beat high ranked professional gamers. It was trained by watching a century of games.
Chatgpt didn’t read few articles, it read years of them, maybe a couple of decades.
Reminds me of Nokia suing Apple (two waves), Blockbuster suing Netflix, and Yahoo suing Facebook. Threatened, declining company suing a disruptor is what we can expect will always happen I guess. Will be nice to see this stuff finally tested in court though.
Except the news still needs to come from somewhere. While GPT can “create” things, it’s not a journalist. It’s just the next step in aggregation skimming money from the actual sources.
Anyone remember when Bing started threatening people? https://time.com/6256529/bing-openai-chatgpt-danger-alignment/
Interesting take on mastodon on this in this thread: https://hachyderm.io/@Impossible_PhD/111654403989681220
This person seems not to know very much about what they are talking about, despite their confidence in saying it.
It looks like they think the reason AI output can’t be copyrighted is because it’s been “ruled a derivative work” but that’s not the reasoning provided which is that copyright can only protect human creativity, and thus machine output without human involvement can’t be copyrighted - with the judge noting the line of what proportion of human contribution is needed is unclear.
The other suits trying to claim the models are derivative works are either yet to be settled or in some cases have been thrown out.
Even in one of the larger suits on whether training is infringement regarding LLMs, the derivative claim has been thrown out:
Chhabria, in his ruling, called this argument “nonsensical,” adding, “There is no way to understand the LLaMA models themselves as a recasting or adaptation of any of the plaintiffs’ books.”
Additionally, Chhabria threw out the plaintiffs’ argument that every LLaMA output was “an infringing derivative” work and “constitutes an act of vicarious copyright infringement”; that LLaMA was in violation of the Digital Millennium Copyright Act; and that LLaMA “unjustly enriched Meta” and “breached a duty of care ‘to act in a reasonable manner towards others’ by copying the plaintiffs’ books to train LLaMA.”
Social media has really turned into a confirmation bias echo chamber where misinformation can run rampant when people make unsourced broad claims that are successful because they “feel right” even if they aren’t.
Perhaps the reason hallucination is such a problem for LLMs is that in the social media data that’s a large chunk of their training everyone is so full of shit?
Perhaps the reason hallucination is such a problem for LLMs is that in the social media data that’s a large chunk of their training everyone is so full of shit?
Heh. I think it simply shows us that the fundamental principle of artificial neural nets, really captures how the brain works.
Social media has really turned into a confirmation bias echo chamber where misinformation can run rampant
Honestly this can be easily overstated in the case of social media relative to anything else humanity does. But and large no one knows anything and is happy talking and speculating as they do. It was true before social media and it will be after.
The fun part is trying to make sense of it all, thus why I said “interesting”.
I personally have thought the copyright dimension one of the more interesting aspects of AI in the short and medium term and have thought so for years. Happy to hear takes and opinions on the issue, especially as I’m not plugged into the space any more.
If you want an interesting take from a source that understands both the tech and legal sides for real and not just pretend, see this from the EFF:
https://www.eff.org/deeplinks/2023/04/how-we-think-about-copyright-and-ai-art-0
It’s about the diffusion art models and not LLMs, but most of its points still apply (even the point about a stronger case regarding outputs by plaintiffs whose works can be reproduced, such as in this case).
Cheers!
My immediate reaction to the piece is that insofar as it’s trying to predict the path that the courts will take, the author may be too close to the tech while I can imagine judges readily opting to eschew what they’d feel would be excessive technical details in their reasoning. I’m curious to see how true that is.
For me the essential point, made at the end, is what do creators really want from copyright apart from more money … because any infringement case against AI easily spells oppressive copyright law.
I’m curious to see if a dynamic factor in this is how the courts conceive of what the AI actually is and does. The one byte per work argument may come off as naive for instance and lead a judge construct their own model of what’s happening.
Otherwise, the purposes of this thread and the take I posted from mastodon, I’d say the question of whether AI creates copyrightable works and how the broader industries respond to that and what’s legally required of them stands as fundamental in the medium term.
Now curious to see what legal scholarship is predicting, which in some cases probably a better predictor.
Here’s the author’s bio:
Kit is a senior staff attorney at EFF, working on free speech, net neutrality, copyright, coders’ rights, and other issues that relate to freedom of expression and access to knowledge. She has worked for years to support the rights of political protesters, journalists, remix artists, and technologists to agitate for social change and to express themselves through their stories and ideas. Prior to joining EFF, Kit led the civil liberties and patent practice areas at the Cyberlaw Clinic, part of Harvard’s Berkman Center for Internet and Society, and previously Kit worked at the law firm of Wolf, Greenfield & Sacks, litigating patent, trademark, and copyright cases in courts across the country.
Kit holds a J.D. from Harvard Law School and a B.S. in neuroscience from MIT, where she studied brain-computer interfaces and designed cyborgs and artificial bacteria.
The author is well aware of the legal side of things.
Oh I’m sure, and it was a good article to be clear. But “the legal side of things”, especially from a certain perspective, and what the courts (and then the legislature) do with a new-ish issue can be different things.