21 votes

Stack Overflow disables the Creative Commons data dump

5 comments

  1. [2]
    skybrian
    Link
    This is just speculation, but it seems plausible that it has to do with AI companies using it for training.

    This is just speculation, but it seems plausible that it has to do with AI companies using it for training.

    8 votes
    1. [2]
      Comment deleted by author
      Link Parent
      1. skybrian
        Link Parent
        Well I guessed right. Thanks for the link!

        Well I guessed right. Thanks for the link!

        3 votes
  2. [3]
    vord
    Link
    I don't know how I feel about this.. Being able to train LLM's on Wikipedia and StackOverflow (and Github IMO, but the grounds there is shakier), then release tools which removes end users from...

    I don't know how I feel about this..

    Being able to train LLM's on Wikipedia and StackOverflow (and Github IMO, but the grounds there is shakier), then release tools which removes end users from needing to visit those, lining your own pockets while leaving WP and SO in the dust, feels off.

    I think the answer is that a new version of CC/GPL/etc need to address training of AI. Namely a LLM trained on GPL 4.0 would also be subject to GPL terms, or CC that prohibits adding work to an AI dataset.

    6 votes
    1. Interesting
      Link Parent
      I mean, the real answer is that an LLM trained on a work is likely to infringe that work, and so LLM developers need to have a license to all the material they train on. Stack Overflow content is...

      I mean, the real answer is that an LLM trained on a work is likely to infringe that work, and so LLM developers need to have a license to all the material they train on. Stack Overflow content is CC by SA (a copy left license), so only copyleft models (which share their weights) should be able to train with it. That idea would go for tools like Github Copilot too; I'm frankly disgusted that they trained that off of the entirety of public Github regardless of license.

      And on that note, many users contributed to SO because of the copyleft licensing. Developer time answering questions is incredibly valuable and expensive to get in meatspace. Many contributors offered it anyway because they knew their answers would be available to everyone to use, even if someone wanted to spin off a new website - - that used to be called their "fallback" to the site becoming evil.

      3 votes
    2. [2]
      Comment deleted by author
      Link Parent
      1. vord
        Link Parent
        I would also contend that a seperate license/appendix which is quite explicit about usage in machine learning will avoid some (but not all) of the potenial legal loopholes amd decades of court...

        I would also contend that a seperate license/appendix which is quite explicit about usage in machine learning will avoid some (but not all) of the potenial legal loopholes amd decades of court time.

        Something akin to a 'robots.txt' that all data models are expected to follow.

        Like trying to sidestep arguements about an LLM being a derivative work.