I think your average geek used to be like, somewhat academic and erudite and into arcane knowledge and had some level of good faith of wanting to engage in discussion

Now it’s all frauds and absolutely braindead elon stans and crypto dipshits and conservative freaks and people who enjoy and defend watching big tech destroy everything.

  • BaroqueInMind
    link
    fedilink
    arrow-up
    6
    ·
    10 months ago

    You emphasized the words well known but provide no links to back that up because I’ve never known

    • Snot Flickerman@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      15
      ·
      edit-2
      10 months ago

      https://huggingface.co/datasets/defunct-datasets/the_pile_books3

      This dataset is Shawn Presser’s work and is part of EleutherAi/The Pile dataset.

      This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI’s mysterious “books2” dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it’s “all of libgen”, but it’s purely conjecture.

      https://web.archive.org/web/20220522050247/https://huggingface.co/datasets/the_pile_books3

      I emphasize “well known” because it was literally in the description when it was initially uploaded to the internet. It was always right out in the front that this was all the ebooks from private torrent tracker Bibliotik. Shawn Presser/books3 never lied about where it came from. As you can see with the archive.org link, that description about it’s sourcing was on the page in May 2022.

      Bibliotik is a well known private tracker for ebooks and even peddles tools for removing DRM from ebooks. So, arguably, not only are the books pirated, but at some point, a DMCA criminal violation occurred when the DRM was stripped from them. So OpenAIs willingness to use it without question to get their company started should be evidence they’re not concerned about where the data came from or getting it in more legal ways.