TL;DR: AI will be able to generate more data after it has been trained that is comparable to the quality of the training data, thereby rendering any training data absolutely worthless. The time to sell data at a reasonable price is now, and those locking their data behind huge financial barriers (such as Twitter and Reddit) are stupidly HODLing a rapidly deprecating asset.
Yeah, if it’s a situation where you’re feeding in a bunch of articles that have a few random misspelled words in each of them, it should mostly be able to figure out the correct spelling as long as it’s not the same words being misspelled the same way each time. However, if it adopts a particular misspelling of a word as “correct”, feeding the LLM’s output back into itself won’t fix that.