• NotAPenguin@kbin.social
    link
    fedilink
    arrow-up
    4
    ·
    1 year ago

    The article doesn’t explain how that’s the case at all.

    Aren’t all the big AI models trained on publicly available data?

    • Hot Saucerman@lemmy.ml
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      1 year ago

      Books3 is the definition of “not publicly available” because it’s all from pirated material downloaded from private torrent tracker Bibliotik.

      Books3 is literally why several of AI groups are being sued by various authors like Sarah Silverman and George R.R. Martin.

      Books3 was always illicitly obtained material which put into question whether an LLM using it could really fall under Fair Use. (It most likely does, but it’s still a legal question that hasn’t been answered yet.)

      Books3 Link: https://huggingface.co/datasets/the_pile_books3

      Books3 Description from Link:

      This dataset is Shawn Presser’s work and is part of EleutherAi/The Pile dataset.

      This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI’s mysterious “books2” dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it’s “all of libgen”, but it’s purely conjecture.

    • SpeakinTelnet@programming.dev
      link
      fedilink
      arrow-up
      3
      ·
      1 year ago

      I see it more like your address is public in a sense that if I could knock on every door and look through every window I would eventually see where you live. But, I probably wouldn’t be able to quickly search where you live because it’s not made to be public knowledge.

      AI take everything and makes it easily searchable for itself even if it wasn’t made to be.