TL;DR: AI will be able to generate more data after it has been trained that is comparable to the quality of the training data, thereby rendering any training data absolutely worthless. The time to sell data at a reasonable price is now, and those locking their data behind huge financial barriers (such as Twitter and Reddit) are stupidly HODLing a rapidly deprecating asset.

  • Somdudewillson@burggit.moe
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    There is at least some evidence that LLMs can learn to produce “better” outputs than any of their training examples - admittedly, the example I’m referring to was using a synthetic grammar to test the capabilities of LLMs with a problem of known difficulty, but the fact remains that they trained a model with only examples containing errors and got a model that could produce entirely correct output.

    • rinkan 輪姦@burggit.moe
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Yeah, if it’s a situation where you’re feeding in a bunch of articles that have a few random misspelled words in each of them, it should mostly be able to figure out the correct spelling as long as it’s not the same words being misspelled the same way each time. However, if it adopts a particular misspelling of a word as “correct”, feeding the LLM’s output back into itself won’t fix that.