Article: Data Protectionism Is Self-Defeating

SquishyPillow@burggit.moe · 1 year ago

Article: Data Protectionism Is Self-Defeating

rinkan 輪姦@burggit.moe · 1 year ago

The author’s larger point about the non-viability of data cartels may be correct, but the claim that AI can be trained on their own output seems wrong. If an AI is giving incorrect output, adding that output to the training data will just reinforce the error, not correct it.

RA2lover@burggit.moe · 1 year ago

you don’t need to use all the output for training if you can separate the good parts. “OpenAI” reportedly used paid for (and is now using free) RLHF for this, Anthropic is trying to develop RLAIF to achieve the same.

Somdudewillson@burggit.moe · 1 year ago

There is at least some evidence that LLMs can learn to produce “better” outputs than any of their training examples - admittedly, the example I’m referring to was using a synthetic grammar to test the capabilities of LLMs with a problem of known difficulty, but the fact remains that they trained a model with only examples containing errors and got a model that could produce entirely correct output.

rinkan 輪姦@burggit.moe · 1 year ago

Yeah, if it’s a situation where you’re feeding in a bunch of articles that have a few random misspelled words in each of them, it should mostly be able to figure out the correct spelling as long as it’s not the same words being misspelled the same way each time. However, if it adopts a particular misspelling of a word as “correct”, feeding the LLM’s output back into itself won’t fix that.

SquishyPillow@burggit.moe · edit-2 1 year ago

If AI-generated data is curated, I believe it can be used to train AI more. Curation itself can be covertly crowdsourced by deploying LLM bots on social media and selecting only generated messages that receive the most likes/upvotes/whatever to use for training.

I should also mention that synthetic data curation has already been proven to be successful to some degree. WizardLM is trained on the evol-instruct dataset, which is a synthetic dataset generated by ChatGPT. You can read more about how the dataset and model were created here. And if you want to evaluate WizardLM itself, the model is available in GGML format in various sizes here.