41 votes

Database containing nearly 200,000 pirated books being used to train AI - authors were not informed

12 comments

  1. [7]
    Oxalis
    (edited )
    Link
    EDIT: Dug through my reddit history and found sources. Here's one of the original Hacker News posts about book3 along with the introduction tweet about its release:...

    EDIT: Dug through my reddit history and found sources.

    Here's one of the original Hacker News posts about book3 along with the introduction tweet about its release: https://news.ycombinator.com/item?id=24884789

    From what I've heard and seen, the original corpus of books was acquired from an invite-only ebook torrent tracker (MAM bibliotik) that was scraped in full by a piracy group called the-eye who sought to "free" all information from the hands of private communities... access to the firehose of freed data was behind a donation wall though. c:

    The-eye was is a weird, fraught with drama, community run by a gentlemen just called "The Archivist". He was brash, bigoted, and had a vast library of VHS rips from which he would make weird video collages from. He made multiple mentions to being an employee of The Internet Archive (TIA) which was corroborated by torrent tracker users seeing a new peer leeching everything from them coming a *.archive.org domain during their "project liberation" download spree. If there are any journalists about, this could be worth looking into.

    Here's the original /r/opendirectories post from The Archivist himself about the book rip - https://www.reddit.com/r/opendirectories/comments/f2teym/project_liberation_bibliotik_terabytes_of_ebooks/

    The corpus was originally available on github for quite some time before the repository was taken down. Forks and clones pop up here and there but now you have to ask around in AI communities for a copy, where it is readily available.

    I'm not positive about AI as a whole and the use of stolen work to train for-profit systems that seek to take as large a piece of the pie of creative-industry dollars as possible; all the while being shielded by breathless and aggressive online defenders is just so disheartening.

    20 votes
    1. [6]
      smiles134
      Link Parent
      Did they acknowledge the irony of this in any way?

      a piracy group called the-eye who sought to "free" all information from the hands of private communities... access to the firehose of freed data was behind a donation wall though. c:

      Did they acknowledge the irony of this in any way?

      5 votes
      1. [2]
        Oxalis
        Link Parent
        That's actually what started the big drama that I watched unfold when I used to hang out in their discord. Some of the (unpaid) staff didn't like his pay-for-access approach so they made a full...

        That's actually what started the big drama that I watched unfold when I used to hang out in their discord.

        Some of the (unpaid) staff didn't like his pay-for-access approach so they made a full dump of the private staff chat available. It showed just how awful The Archivist was when he felt he could be glib which then led to multiple staff leaving and lots of shakeups in the organizational structure, as often happens in edgy technical communities.

        9 votes
        1. flowerdance
          Link Parent
          Imagine if The Archivist was actually someone who had the personality and mentality for the job of open knowledge. Shame.

          Imagine if The Archivist was actually someone who had the personality and mentality for the job of open knowledge. Shame.

          3 votes
      2. [3]
        tesseractcat
        Link Parent
        I'll play devils advocate here and say I don't know if this is really as ironic as it seems on first blush. Private trackers like Bibliotik are basically impossible to get into, for the average...

        I'll play devils advocate here and say I don't know if this is really as ironic as it seems on first blush. Private trackers like Bibliotik are basically impossible to get into, for the average person. And huge archives like the-eye cost a lot of money to run. Even if you charge money, it's still a lot more open than a private tracker, and without donations it might not exist.

        9 votes
        1. [2]
          nocut12
          (edited )
          Link Parent
          I think it's kind of the same reason private trackers exist. You need some kind of external incentive to keep niche stuff available on peer to peer networks (because there often simply aren't...

          I think it's kind of the same reason private trackers exist. You need some kind of external incentive to keep niche stuff available on peer to peer networks (because there often simply aren't enough interested people). For private trackers, that incentive is through the internal reputation systems and wanting to keep access. For this thing, it's the money.

          I guess there's value in making things more widely available than they are on private trackers, but doing it this way made it a lot more vulnerable to getting taken down. If the way to get this dataset today is to ask around in a specialized community, well... it doesn't sound that different from getting on private trackers.

          I guess I think it's not too surprising that this would flame out quickly, and there's a reason why the status quo with private torrent trackers is the way it is.

          5 votes
          1. tesseractcat
            Link Parent
            I agree, it's mostly predictable that it didn't work out so well. I still think there's value in trying different incentive models (even if they don't work out), as private trackers can be quite...

            I agree, it's mostly predictable that it didn't work out so well. I still think there's value in trying different incentive models (even if they don't work out), as private trackers can be quite restrictive.

            3 votes
  2. vord
    (edited )
    Link
    So based on copyright damages of $150,000 per offense (because this was willful, intentionally sought out, and used for further profit), which copyright holders could legally seek out. Assuming we...

    So based on copyright damages of $150,000 per offense (because this was willful, intentionally sought out, and used for further profit), which copyright holders could legally seek out. Assuming we only count each work once, and cut the books in half to account for those that might actually be public domain...

    That puts the approximate "value" of one of these LLMs at $15 billion dollars of damages. And you could theoretically seek that for each 'pass' that's done through the dataset.

    I wager similiar things gave been done for music, movies and art. Turns out going for end user piracy was always small potatoes. Hollywood could fund the next decade of blockbusters off punitive damages againt tech companies.

    8 votes
  3. [3]
    balooga
    Link
    The link says 20,000 but the article says 200,000. That's big difference, would be good to clear up that error.

    The link says 20,000 but the article says 200,000. That's big difference, would be good to clear up that error.

    6 votes
  4. Stranger
    Link
    In general I'm not inclined to believe creators have a legal leg to stand on vis-a-vis copyright law when their works are used to train AI without their prior approval. That said, it's one thing...

    The problem? No one told the authors.

    In general I'm not inclined to believe creators have a legal leg to stand on vis-a-vis copyright law when their works are used to train AI without their prior approval. That said, it's one thing when bots are scraping publicly available or legally purchased data; it's another thing to use pirated content. It'll be interesting to see what sort (if any) of legal consequences come of this and how it might set precedents.

    1 vote