40 votes

‘Not for machines to harvest’: Data revolts break out against AI

35 comments

  1. [4]
    hobbes64
    Link
    I’m sympathetic to the creators featured in the article, but when it mentioned Reddit and Steve Huffman I got a bit annoyed Dude, you didn’t create any of that value, you just paid for the servers...

    I’m sympathetic to the creators featured in the article, but when it mentioned Reddit and Steve Huffman I got a bit annoyed

    Larger companies are also pushing back against A.I. scrapers. In April, Reddit said it wanted to charge for access to its application programming interface, or A.P.I., the method through which third parties can download and analyze the social network’s vast database of person-to-person conversations.
    Steve Huffman, Reddit’s chief executive, said at the time that his company didn’t “need to give all of that value to some of the largest companies in the world for free.”

    Dude, you didn’t create any of that value, you just paid for the servers and the community created it. It pisses me off that these types think they are owed something. He already got the money he deserves from advertising and Reddit coins and premium. If he lost money with those it’s his problem that he had a shitty business plan.

    This is besides the fact that he killed the 3rd party apps in a separate attempt to monetize the data and pretended it was because of AI scraping.

    63 votes
    1. [2]
      Trauma
      Link Parent
      Well, he's right. Reddit does own the data, that's what they write in their TOS. I miss Reddit, too, but whatever our feelings are about the API crackdown, that's always something to keep in mind...

      Well, he's right. Reddit does own the data, that's what they write in their TOS. I miss Reddit, too, but whatever our feelings are about the API crackdown, that's always something to keep in mind when interacting with a website that's fueled by user content, and it's a reason I'm here now.

      18 votes
      1. Wes
        Link Parent
        It depends what you mean by data, but they certainly do not own any comments. Their authors retain ownership, and they license that content to Reddit to use and display....

        Reddit does own the data, that's what they write in their TOS.

        It depends what you mean by data, but they certainly do not own any comments. Their authors retain ownership, and they license that content to Reddit to use and display.

        You retain any ownership rights you have in Your Content, but you grant Reddit the following license to use that Content

        https://www.redditinc.com/policies/user-agreement-april-18-2023

        Tildes works the same way.

        You retain copyright and ownership of any of your own content that you submit to Tildes (except for contributions to Tildes wiki pages, see below). However, you grant us a non-exclusive license to store, display and distribute that content in the context of operating the site, subject to our Privacy Policy.

        https://docs.tildes.net/policies/terms-of-use#content-you-post-to-tildes

        24 votes
    2. qob
      Link Parent
      To me, this comes down to "the whole is more than the sum of its parts". Yes, reddit would be nothing without the contributions, but without reddit's infrastructure, there wouldn't be anything to...

      To me, this comes down to "the whole is more than the sum of its parts". Yes, reddit would be nothing without the contributions, but without reddit's infrastructure, there wouldn't be anything to contribute to. Most of the content would exist somewhere, but it wouldn't be shared, organized and curated in a single place where everyone can meet and add to it.

      I think some kind of default GPL license for public content would be best. You can use people's creative output for free as long as your derivatives are also free. OpenAI could either pay people to generate proprietary training data and keep their products locked behind corporate bars, or they could use public training data and provide free and open access to everyone.

      Everyone is standing on the shoulders of giants. Sharing is what made us so successful as a species.

      4 votes
  2. [20]
    feanne
    (edited )
    Link
    I agree that we should question the legitimacy of Big Tech's claims over all published data as training material. Big corporations should not be able to harvest, en masse, what is essentially a...

    The data rebellion that we’re seeing across the country is society’s way of pushing back against this idea that Big Tech is simply entitled to take any and all information from any source whatsoever, and make it their own

    I agree that we should question the legitimacy of Big Tech's claims over all published data as training material.

    Big corporations should not be able to harvest, en masse, what is essentially a public resource and then turn around and privatize the monetization of that in a way that actually harms the public (by reducing work opportunities or otherwise disrupting markets on a mass scale).

    This isn't just about copyright. We're already living in a world wherein unfettered data harvesting is perpetuating systemic oppression:

    companies like RELX exploit a lack of data privacy laws to make millions of dollars building data products to sell to cops, your employer, your landlords, your insurance companies, and all sorts of other institutions and overlords. These companies and institutions use RELX’s “risk” products to make decisions about whether you should get hired for a job, have custody of your children, have access to certain types of medication, and even whether you will be detained or arrested. RELX’s LexisNexis products have helped the government spy on protesters’ social media accounts and surveil immigrants. Police have abused LexisNexis systems to spy on exes and even to blackmail women using the personal information the company’s policing products provide.

    P.S. Stable Diffusion's lawyer is Mark Lemley, co-founder of legal data analytics company Lex Machina which was acquired by LexisNexis in 2015.

    30 votes
    1. [2]
      Comment deleted by author
      Link Parent
      1. feanne
        Link Parent
        Re ToS: I'm arguing that they shouldn't be allowed to use it in this way, because it's exploitative. This other comment has already explained how it's already only big corporations who can afford...

        Re ToS: I'm arguing that they shouldn't be allowed to use it in this way, because it's exploitative.

        This other comment has already explained how it's already only big corporations who can afford to train LLM.

        Allowing unfettered data harvesting will only strengthen big corporations and help them continue to extract value out of public resources and privatize/consolidate it for themselves.

        Based on the principle that public resources should primarily serve the public good, I would support regulations such as:

        • allowing scraping w/o consent for non-commercial research purposes only (and usage cannot be changed to commercial later on, which is what happened with OpenAI DALL-E 2, for example)
        • requiring all corporate-owned AI tech to be open source
        6 votes
    2. [12]
      the9tail
      Link Parent
      The discussion is that research through websites in itself can not and should not be prevented and since research can’t be stopped neither can a machine who is essentially researching the entire...

      The discussion is that research through websites in itself can not and should not be prevented and since research can’t be stopped neither can a machine who is essentially researching the entire internet at once.

      If the data is presented is an amalgam of different data sources, then the only thing separating each is referencing. Which LLMs can provide when asked. Since there is no ownership of interpretation - again LLMs aren’t doing anything different to a pen and paper researcher.

      So what’s the answer? Well it’s obvious, if people are complaining that tech companies are stealing their data, then they need to paywall it. If it’s free to access, it’s free and acknowledge they need to move away from being free by advertising.

      3 votes
      1. [11]
        feanne
        Link Parent
        As an artist, I genuinely don't want to charge people a fee just to view my art online. I also want to differentiate between research and commercial use. When I participated in closed beta testing...

        As an artist, I genuinely don't want to charge people a fee just to view my art online. I also want to differentiate between research and commercial use. When I participated in closed beta testing of OpenAI DALL-E 2 mid last year, I remember being told by the DALL-E 2 team that this was strictly all for non-commercial research purposes. And it turned out it wasn't. I don't think it's ok to use creators' works without consent for "research" and then turn around and commercialize it.

        As for "what's the difference between a human doing it vs generative AI", this comment in another post discusses that comparison very thoughtfully. I'll just add that a human can do it without having to exploit laborers who do the content moderation work required to train AI.

        11 votes
        1. [10]
          the9tail
          Link Parent
          Don’t know if you realise that a lot of research published online has to be paid for right? University/Business pay for access and then give access to their students/employees. All research is...

          Don’t know if you realise that a lot of research published online has to be paid for right? University/Business pay for access and then give access to their students/employees.

          All research is ultimately commercial if it isn’t a student writing an essay. Things not behind a paywall, are free for use - once you make that decision you don’t get to dictate what use because of a moral objection.

          A computer learning and using patterns, is the same way anyone else learns art. The objection that the scale has increased because a computer can do it doesn’t change what the activity is being done and how the rules apply.

          Of course this is all moot. We see things through the lens of a responsible government which can make rules and the possibility of restricted company actions - but the reality is LLMs are such a buzz now, Meta made theirs open source. Which means LLMs are in the hands of everyone, including company’s that ignore copyright or rules. Combined with the fact the scraping has already begun, means anything older than today is already processed.

          Welcome to the future, it’s going to get more complicated as actors try to copyright their look, voice, posture and facial expressions. Where art and written expression is boiled down to an owning an idea, then selling that idea to put into LLMs for scripts, books and dirty limericks. Where prompt engineer is a entry level job at every workplace. A world where you can goto a website, “write Terminator 2, replace Arnie with Stalone, and set it Old West and replace robots with molemen in disguise” and wait 5mins as the bar on the screen fills.

          It’s coming and not later but soon - and whatever reasonable guess you want to give it, expect it sooner. Expect that in the offices at Adobe right now, they are laughing their asses off watching an episode of Futurama made realistic but with amazing cinematography.

          1 vote
          1. [3]
            feanne
            (edited )
            Link Parent
            There's such a thing as publicly funded research, which is then used to benefit the public. That's non-commercial research. What I experienced with OpenAI was not that. They developed the...

            All research is ultimately commercial if it isn’t a student writing an essay.

            There's such a thing as publicly funded research, which is then used to benefit the public. That's non-commercial research.

            What I experienced with OpenAI was not that. They developed the technology under the guise of "non-profit research" and then turned it into a for-profit commercial product.

            Things not behind a paywall, are free for use - once you make that decision you don’t get to dictate what use because of a moral objection.

            There's plenty of content that isn't behind a paywall, that is still governed by terms of use. I don't think you get to dictate what is and isn't acceptable use of a work that you don't have the rights to.

            A computer learning and using patterns, is the same way anyone else learns art. The objection that the scale has increased because a computer can do it doesn’t change what the activity is being done and how the rules apply.

            Why shouldn't the scale of impact affect how we evaluate something? It's pretty normal and sensible for there to be a difference in how we treat, for example, theft of a loaf of bread vs. theft of a whole truckload of bread. And, quoting Cory Doctorow re. thinking about the morality of technology: it's not about what the technology does, but who it does it for and who does it to.

            Of course this is all moot.

            I agree that figuring out fair and ethical usage of AI is extremely difficult. I just don't agree with the fatalism of "it's going to happen anyway so we might as well just go along with it and not bother with figuring out the ethics of it".

            6 votes
            1. [2]
              the9tail
              Link Parent
              It’s not that “it’s happening anyways”, it’s that it’s already happened. The problem here that computing and drugs share a common legal advantage that when a new discovery is found, they get to...

              It’s not that “it’s happening anyways”, it’s that it’s already happened.

              The problem here that computing and drugs share a common legal advantage that when a new discovery is found, they get to exist without any rules for a long enough period of time that their existence can’t be simply squashed before it’s everywhere.

              People can cry out “No” as loud as they want but this stuff exists and it doesn’t matter if your government deems it illegal - it’s still going to happen and someone is gonna tweak it just enough to get around rules to create a second wave.

              If Adobe is banned from making AI art, another website will offer it instead, piratebay will offer desktop software, reddit will provide a tool to remove any tags that says it’s AI and Amazon will resell the art-piece for $50 delivered.

              It’s over before it begun

              1 vote
              1. feanne
                Link Parent
                I still don't agree with "it's already happened anyway so we might as well just go along with it and not bother with figuring out an ethical way to address it"

                I still don't agree with "it's already happened anyway so we might as well just go along with it and not bother with figuring out an ethical way to address it"

          2. [3]
            sparksbet
            Link Parent
            This is a very naive view of machine learning. It's very impressive that we can train models on large quantities of unsupervised data like this, which can definitely give the impression to a...

            A computer learning and using patterns, is the same way anyone else learns art.

            This is a very naive view of machine learning. It's very impressive that we can train models on large quantities of unsupervised data like this, which can definitely give the impression to a layperson that it "learns" like a human does. But it does not learn in ways that map onto how humans learn.

            One of the easiest to understand ways in which its learning differs from human artists is that it does not choose what it's trained on in any way. The datasets used for training these giant models are collected and curated by those designing/training the model (and for many of the models at issue were collected with an apparent total disregard for the licenses of the included works). Even without getting into how totally dissimilar training these models is to how a human learns something like art or language, it's simply not true that the way humans research and acquire these works is the same as how a machine learning model does.

            5 votes
            1. [2]
              the9tail
              Link Parent
              But you are splitting hairs here. The computer can do cubism based on cubism of other artists. The new artist learns cubism by looking at the work of other artists. Sure a computer doesn’t learn...

              But you are splitting hairs here.

              The computer can do cubism based on cubism of other artists. The new artist learns cubism by looking at the work of other artists.

              Sure a computer doesn’t learn in the same manner as a human specifically, but it is looking at images and looking for similarities to other images, grouping them together and using the structures to create different structures.

              1 vote
              1. sparksbet
                Link Parent
                You're anthropomorphizing a bit much. Machine learning models are applying a bunch of complex math to the images they're trained on and tweaking some of the variables based on how well it performs...

                You're anthropomorphizing a bit much. Machine learning models are applying a bunch of complex math to the images they're trained on and tweaking some of the variables based on how well it performs at some metric. It doesn't look for similarities to other images or group them together or use the structures to create different structures. It does a bunch of math on them and tweaks variables until its math is a very good statistical representation of the domain it was trained on. This statistical representation can get really fucking good, which is why these models are so impressive. But the type of conscious "looking at images and looking for similarities to other images" you describe is nothing like how these models are trained.

                You're also ignoring my point that when a human learns about art, they are going out and looking at art themselves and making choices about what to view and learn. Models like this have no ability to make conscious choices about their training data and are just fed images one after another (often the same image more than once) until a human decides it's good enough and stops training. The way in which "artistic influences" are acquired for human artists vs machine learning models could not be more different.

          3. [3]
            public
            Link Parent
            In the art world, Stable Diffusion did this, too. They are the ones I respect for being a digital Prometheus and freeing the ML "magic" for the rest of us. I'll never stop finding it sad how...

            LLMs are such a buzz now, Meta made theirs open source. Which means LLMs are in the hands of everyone, including company’s that ignore copyright or rules

            In the art world, Stable Diffusion did this, too. They are the ones I respect for being a digital Prometheus and freeing the ML "magic" for the rest of us.

            A world where you can go to a website, “write Terminator 2, replace Arnie with Stalone, and set it Old West and replace robots with molemen in disguise” and wait 5 mins as the bar on the screen fills.

            I'll never stop finding it sad how people use similar examples as a negative future. It's conceptual creativity unbounded by monetary (and, hopefully, copyright) constraints.

            2 votes
            1. [2]
              the9tail
              Link Parent
              Don’t know if you are picking up what I am putting down :) I am all for this, why should art be exempt from advancement of human society - it shouldn’t, just like every other job that that’s been...

              Don’t know if you are picking up what I am putting down :)

              I am all for this, why should art be exempt from advancement of human society - it shouldn’t, just like every other job that that’s been lost to innovation in the past and will be lost in the future (robots will replace us all, and it’s about time).

              People are doomsaying but don’t see that humans aren’t lost in the process. We are just moving from ownership of what we make to ownership of the idea behind it and letting something else do the make.

              1. public
                Link Parent
                I advocate for transcending the idea of ownership of ideas and processes altogether. Force them into the common aether.

                I advocate for transcending the idea of ownership of ideas and processes altogether. Force them into the common aether.

    3. [6]
      Greg
      Link Parent
      I’d strongly argue that reducing the quantity of work required by humans is an underlying good for society, not a harm. Admittedly the majority of those companies are also the ones lobbying for...

      I’d strongly argue that reducing the quantity of work required by humans is an underlying good for society, not a harm. Admittedly the majority of those companies are also the ones lobbying for the status quo in terms of working patterns and wealth distribution, so I’m under no illusion about their reasoning or the likely short term outcomes, but I think it’s dangerous to frame the issue as “jobs” or “work” being something inherently valuable and worth protecting when it’s actually a resource allocation question.

      If we’re not careful about the framing, we just go further down the bullshit jobs rabbit hole, and I feel like we’d already hit the point of absurdity there even before ML and remote working came along to shake things up. I’m not exactly an optimist either way, but that’s exactly why I don’t see maintaining the status quo as a viable possibility: either those currently in power exploit new tech for further gain, or the population as a whole takes the opportunity to renegotiate the social contract. Standing still while there’s this much opportunity on the table doesn’t seem like something those who stand to gain would ever entertain.

      3 votes
      1. [5]
        feanne
        Link Parent
        I agree with this and would love to see more AI taking on dangerous work, such as repairing deep-sea cables, sorting garbage, and handling toxic waste. Not too keen on AI reducing work...

        I’d strongly argue that reducing the quantity of work required by humans is an underlying good for society, not a harm.

        I agree with this and would love to see more AI taking on dangerous work, such as repairing deep-sea cables, sorting garbage, and handling toxic waste. Not too keen on AI reducing work opportunities in the creative industries. I thought the whole idea of AI reducing human workload was so that we'd be free to do more of the fun stuff, not less.

        I mean, I guess it would be fine for AI to replace human workers if we already lived in a society where humans generally don't need to work to live. But since we don't, and since systemic change doesn't happen overnight, I still see "work opportunities" as worth protecting because people need that to live.

        I agree with your concern re. those in power exploiting new tech and also agree with you that main issue is resource allocation. We definitely have to grapple with how technology is often used by the powerful to maintain the status quo.

        6 votes
        1. [4]
          Greg
          Link Parent
          Wait, there was a goal this whole time?! In all seriousness, I do actually think that a lot of tech comes into existence because either a researcher finds it exciting or it's needed for a niche...

          I thought the whole idea of AI reducing human workload was so that we'd be free to do more of the fun stuff, not less.

          Wait, there was a goal this whole time?! In all seriousness, I do actually think that a lot of tech comes into existence because either a researcher finds it exciting or it's needed for a niche scientific purpose, and it's only later that the wider uses and implications get figured out. To me, that's important, because it suggests that the genie is out of the bottle before there's a realistic opportunity for anyone to realise it, and that in turn means that progress and change - even risky, potentially destructive progress and change - is effectively inevitable.

          Through all that change, I can say with confidence that machines will never take art from humans, because art is inherently about self expression - the act of creation is a goal in and of itself. As I see it the only way to free up humans' time to do more of the fun stuff is to decouple the doing of it from the getting paid for it.

          I mean, I guess it would be fine for AI to replace human workers if we already lived in a society where humans generally don't need to work to live.

          My thoughts exactly - any while I know we're hardly close to that point, it frustrates me that we have enough resources that we could be. The number of conversations I've seen with people even advocating for things like banning self checkouts because they'd literally rather force a human to unnecessarily spend the precious hours of their life standing and scanning items all day rather than just giving them the fucking money anyway is... depressing. It's not something I'm accusing you of doing, at all, it's just the whole tenor of the wider conversation and I really struggle with that.

          It seems like we're so far down the road into an economy where some enormous percentage of working hours are broadly unnecessary already, and so few people are acknowledging it that I almost feel like I'm the crazy one. I feel the need to say it loudly because I worry that we'll never make the change if we don't first shift the conversation.

          But since we don't, and since systemic change doesn't happen overnight, I still see "work opportunities" as worth protecting because people need that to live.

          I think the subtext to what I'm saying is I don't see that as particularly possible. Call it productive cynicism, if you will: there's value to be gained from these new tools, and that means standing still (in this case by preserving the jobs in the style and volume that currently exist) is not an option. I genuinely don't see a realistic near-term framework by which we could compel businesses to keep employing humans for things a machine could do some combination of faster, cheaper, and/or better.

          If I'm right in that assertion (and I accept that's the big "if" here), then the only thing to be decided is how the value is distributed: does it return to the workers, in the form of shorter hours for similar standard of living, or does it collect at the top at the expense of everyone else?

          2 votes
          1. [3]
            feanne
            Link Parent
            I agree with both of these points. The only caveat being the possibility of AI becoming sentient, and actually expressing itself. And that's a whole other philosophical discussion :)) I'm sorry...

            Through all that change, I can say with confidence that machines will never take art from humans, because art is inherently about self expression - the act of creation is a goal in and of itself. As I see it the only way to free up humans' time to do more of the fun stuff is to decouple the doing of it from the getting paid for it.

            I agree with both of these points. The only caveat being the possibility of AI becoming sentient, and actually expressing itself. And that's a whole other philosophical discussion :))

            The number of conversations I've seen with people even advocating for things like banning self checkouts because they'd literally rather force a human to unnecessarily spend the precious hours of their life standing and scanning items all day rather than just giving them the fucking money anyway is... depressing.

            I'm sorry that you've seen a lot of this type of sentiment. What's their reasoning for wanting to force someone into a tedious job? It seems unnecessarily mean. While I personally wouldn't advocate banning self checkouts, I would voice concern in case the advent of any tech drastically causes a lot of people to lose jobs without alternative opportunities or safety nets. With generative AI for example, I'm concerned about the call center industry here in the Philippines. There are entire economic zones here that are dependent on this industry. It's not that I want people to be stuck in draining work handling difficult phone calls, but I would want them to have other options or safety nets in case generative AI causes them to lose their jobs.

            I genuinely don't see a realistic near-term framework by which we could compel businesses to keep employing humans for things a machine could do some combination of faster, cheaper, and/or better... the only thing to be decided is how the value is distributed: does it return to the workers, in the form of shorter hours for similar standard of living, or does it collect at the top at the expense of everyone else?

            If a business treats humans as disposable and easily replaceable by machines, then it's unlikely to distribute value fairly to its workers.

            1 vote
            1. [2]
              Greg
              Link Parent
              This is a very good question - from what I've seen I'd say it's a combination of people literally just not considering that there's enough productivity for some people to exist comfortably without...

              What's their reasoning for wanting to force someone into a tedious job?

              This is a very good question - from what I've seen I'd say it's a combination of people literally just not considering that there's enough productivity for some people to exist comfortably without a job (or with a 2-3 day workload), people who think jobs are economically valuable in and of themselves, and people who think "hard work" is a moral good in and of itself. I just hope that there are more of the first than the second two.

              As to the rest, again I think we're pretty much on the same page; if anything I'm just jumping ahead to the practicalities. If we do nothing, the jobs go away - and like you said, the owners who do that will just take the profits for themselves.

              Keeping the jobs even though it costs the business more would inherently mean legislating, and I really don't even know what that would look like, let alone have a good idea of how to construct a law that wouldn't be outdated in a matter of months at the rate things are changing. The government would be going head to head with the business leaders, and doing so on a shifting foundation that the lawyers could barely keep up with, only to maintain the status quo. If we're in a world where the government is willing to do that anyway, it seems like they may as well just place a high windfall tax on the automation and use it to fund the safety net? If we're in a world where the government aren't willing to do that, I'm not seeing how the jobs would be protected.

              1. feanne
                (edited )
                Link Parent
                I see, thanks for sharing. Personally I don't think hard work is inherently a moral good, but I do think that humans deserve to have opportunities to do meaningful work. Yes, something like this...

                This is a very good question - from what I've seen I'd say it's a combination of people literally just not considering that there's enough productivity for some people to exist comfortably without a job (or with a 2-3 day workload), people who think jobs are economically valuable in and of themselves, and people who think "hard work" is a moral good in and of itself. I just hope that there are more of the first than the second two.

                I see, thanks for sharing. Personally I don't think hard work is inherently a moral good, but I do think that humans deserve to have opportunities to do meaningful work.

                If we're in a world where the government is willing to do that anyway, it seems like they may as well just place a high windfall tax on the automation and use it to fund the safety net?

                Yes, something like this has been suggested by information law scholar Ben Sobel in his paper Artificial Intelligence’s Fair Use Crisis, page 90 under section III "What Can Be Done?", part A "Levies". (The rest of his paper is really insightful too, I highly recommend it to anyone interested in generative AI and copyright law.)

                Personally I would try not to focus too much on the question "how can we compel businesses to retain their employees who might be easily replaced by AI".

                I would rather look for multiple approaches to "how can we prevent big corporations from consolidating too much power?" and "how can we expand social safety nets to support more people in general, and how can we make more opportunities accessible to more people?". Re. mitigating corporate power, I've already commented previously re. examples of the types of AI-related regulations I would support for this.

                1 vote
  3. [3]
    ourari
    Link
    Mirror in case of paywall: https://archive.is/KRHVh

    Fed up with A.I. companies consuming online content without consent, fan fiction writers, actors, social media companies and news organizations are among those rebelling.

    Mirror in case of paywall: https://archive.is/KRHVh

    15 votes
    1. [2]
      phoenixrises
      Link Parent
      mark this as a joke pls but i feel like there's a certain form of irony here

      mark this as a joke pls but i feel like there's a certain form of irony here

      9 votes
      1. Trauma
        Link Parent
        The first thing you do after you clawed your way into the Palace of Plenty is to close the door behind you.

        The first thing you do after you clawed your way into the Palace of Plenty is to close the door behind you.

        7 votes
  4. gpl
    Link
    I highly recommend the book “Who Owns the Future?” by Jaron Lanier (if you don’t know him, definitely an interesting figure to look up) that makes a very interesting argument along these lines....

    I highly recommend the book “Who Owns the Future?” by Jaron Lanier (if you don’t know him, definitely an interesting figure to look up) that makes a very interesting argument along these lines. The details are a bit fuzzy for me now but I remember being struck by his explanation of the problem and creative ideas for solutions. One was along the lines of “data unions” (along the lines of labor unions) that people can join and which allow them to be control how their data is used and how they are compensated for that. The big caveat being that for this to work, data about someone must be inseparably linked to that person, which would require a restructuring of many of our existing networks (two way links similar to what Berners-Lee originally intended for the web).

    Even if this sounds unworkable I recommend checking the book out since I am definitely not doing the argument justice. But it’s certainly a different perspective than I was used to, and made me rethink a lot of my beliefs regarding data and data privacy (should information really be ‘free’ when the average person cannot have the computational resources to compete with massive corporations? Think along the lines of arguments concerning money in politics).

    13 votes
  5. [5]
    blindmikey
    Link
    If you're open to be crawled for SEO, you should be open to be scraped for AI. Preventing AI from learning is not going to end well. Besides the fact (in the case of reddit at least) your users...

    If you're open to be crawled for SEO, you should be open to be scraped for AI. Preventing AI from learning is not going to end well. Besides the fact (in the case of reddit at least) your users made that content to be shared, it's not yours to gatekeep.

    9 votes
    1. TallUntidyGothGF
      (edited )
      Link Parent
      Why? SEO directs people to my content (and perhaps I was aware that it supports business analytics, use in research studies, learning of databases, though I'll admit I'm far from the average...

      If you're open to be crawled for SEO, you should be open to be scraped for AI.

      Why? SEO directs people to my content (and perhaps I was aware that it supports business analytics, use in research studies, learning of databases, though I'll admit I'm far from the average user). Meanwhile, an LLM strip mines it for a contribution to its knowledge base, where my content and its provenance will be forever lost in the convolution of its weights. The LLM will recruit it into a system that will contemporaneously profit from and render it redundant. It will contribute to a system that will concentrate capital, alienate people even more from each other and the things that they produced, and I don't consent to that.

      Moreover, and whether you agree or not with the previous: I was aware of the former use when I made the content and put it online, I was not aware of the latter. Legal or not, I don't think it's fair to expect that by consenting to some public use of the content we produce, we consent to all future uses in perpetuity.

      Besides the fact (in the case of reddit at least) your users made that content to be shared, it's not yours to gatekeep.

      Whatever is the legal reality: I fully agree with you that the users made that content, and that the users should own it, and decide what should be done with it - not Reddit (or whoever). But I suppose the horse has already circumnavigated the globe on that one!

      Preventing AI from learning is not going to end well.

      Would love to hear more of your thoughts on this. I personally see it as essentially inevitable that LLMs will learn from all retrospective data, either directly or indirectly through input from current models that were already trained on the data, or will be trained by people who don't need to be publicly legally liable (e.g. via anonymity or protection on account of the political system they happen to be governed by) once the means of producing LLMs becomes more widely available. The LLM companies have been very wise to train their models and make them public and useful before giving anyone the chance to ask these questions publicly (with knowledge of the enormity of the implications). The time for the solutions @gpl mentioned in their comment, data unions and alike, was before these models were trained. Of course, the next best time is now.

      17 votes
    2. [3]
      ourari
      Link Parent
      Being crawled by search engines is different than being crawled for AI training sets. The understanding is that search engines help people to find your stuff where you put it for them to see....

      Being crawled by search engines is different than being crawled for AI training sets. The understanding is that search engines help people to find your stuff where you put it for them to see. Scraping for AI boils down to taking it and stripping it of its owner and context.

      https://searchengineland.com/crawlers-search-engines-generative-ai-companies-429389

      Yes, people have shared things on Reddit, with the understanding that they were sharing it on Reddit and specifically with the communities of particular subreddits.

      Would you really want to live in a world where anyone could take whatever they see? Better close your curtains and hide anyone and everything you care about.

      12 votes
      1. [2]
        public
        Link Parent
        Or exactly the same, if the crawler is Googlebot. They probably already had the data from search crawling and didn't need a second scrape for AI training.

        Being crawled by search engines is different than being crawled for AI training sets.

        Or exactly the same, if the crawler is Googlebot. They probably already had the data from search crawling and didn't need a second scrape for AI training.

        1 vote
        1. ourari
          Link Parent
          Not the same, as the difference I pointed out is about the intention behind it, the implicit agreement.

          Not the same, as the difference I pointed out is about the intention behind it, the implicit agreement.

          3 votes
  6. [2]
    Macil
    (edited )
    Link
    I'm surprised by the sentiment. I like knowing my posts might help inform or influence people or AI, and I like things that increase the reach and chance of this. I don't feel cheated if someone...

    I'm surprised by the sentiment. I like knowing my posts might help inform or influence people or AI, and I like things that increase the reach and chance of this. I don't feel cheated if someone learned something or got financial success after reading my posts. Maybe in a really fair world where everything was tracked and quantified, it would be arranged so I get some kickback for that, but in the case of AI any single person's fair payout is probably much smaller than a penny so I can't feel too strongly about the fact I don't live in that world. I like knowing that many people in the world and GPT might have a tiny slice of me through my writing.

    It feels apparent to me that the outcome of trying to prevent AI scraping is a more closed internet, with more moves like Reddit blocking off public API access.

    7 votes
    1. feanne
      Link Parent
      I think that a data harvesting framework based on respecting creators' consent would actually facilitate your desire to have your work be scraped, too. Like there should be an easy way for...

      I think that a data harvesting framework based on respecting creators' consent would actually facilitate your desire to have your work be scraped, too. Like there should be an easy way for creators to indicate that they consent to data scraping, and an easy way for AI tech builders to find and use this content. Spawning is one such initiative that's trying to set up this kind of system. In this system, your desire to allow scraping of your own content does not interfere with others' desire to opt out, and vice versa.

      7 votes