How much does it bother you that OpenAI is trained on your data? What can we do about it?

duncesplayed · edit-2 3 years ago

How much does it bother you that OpenAI is trained on your data? What can we do about it?

jonah · 3 years ago

The biggest problem to me is what I just saw you post in another reply, that these models built upon our knowledge exist almost solely within proprietary ecosystems.

and maybe even our Mastodon or Lemmy posts!

The Washington Post published a great piece which allows you to search which websites were included in the “C4” dataset published in 2019. I searched for my personal blog jonaharagon.com and sure enough it was included, and the C4 dataset is practically minuscule compared to what is being compiled for larger models like ChatGPT. If my tiny website was included, Mastodon and Lemmy posts (which are actually very visible and SEO optimized tbh) are 100% being scraped as well, there’s no maybe about it.

Schedar@beehaw.org · 3 years ago

Thanks for linking to that, I hadn’t seen that article before. Interesting seeing it broken down like that and being able to search for a website to see if it was part of the training data