TL;DR: AI will be able to generate more data after it has been trained that is comparable to the quality of the training data, thereby rendering any training data absolutely worthless. The time to sell data at a reasonable price is now, and those locking their data behind huge financial barriers (such as Twitter and Reddit) are stupidly HODLing a rapidly deprecating asset.
There is at least some evidence that LLMs can learn to produce “better” outputs than any of their training examples - admittedly, the example I’m referring to was using a synthetic grammar to test the capabilities of LLMs with a problem of known difficulty, but the fact remains that they trained a model with only examples containing errors and got a model that could produce entirely correct output.
Yeah, if it’s a situation where you’re feeding in a bunch of articles that have a few random misspelled words in each of them, it should mostly be able to figure out the correct spelling as long as it’s not the same words being misspelled the same way each time. However, if it adopts a particular misspelling of a word as “correct”, feeding the LLM’s output back into itself won’t fix that.