44 votes

Google updates its privacy policy to clarify it can use public data for training AI models

24 comments

  1. [2]
    Deimos
    (edited )
    Link
    This seems awfully sensationalized when the reality is that they changed a few words in their privacy policy, and it's just a clarification of something that was already there, not any kind of new...

    This seems awfully sensationalized when the reality is that they changed a few words in their privacy policy, and it's just a clarification of something that was already there, not any kind of new capabilities. The diff between versions is linked in the article, but this whole article is based on them changing:

    For example, we may collect information that’s publicly available online or from other public sources to help train Google’s language models and build products and features like Google Translate.

    to

    For example, we may collect information that’s publicly available online or from other public sources to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.

    32 votes
    1. soymariposa
      Link Parent
      Exactly and this is the key point. What Google is doing isn’t new at all, instead they are just putting out the legalese to avoid lawsuits etc. Between 2017-2019, I worked for Lionbridge doing...

      Exactly and this is the key point. What Google is doing isn’t new at all, instead they are just putting out the legalese to avoid lawsuits etc.

      Between 2017-2019, I worked for Lionbridge doing content analysis of sites and Google home recordings. For the site analysis, it was grading the alignment between the query and the result/the quality of the result, and for the recordings it was grading how well the speech was recognized into something useful (and boy howdy are there a lot of folks out there with some crazy and sloppy diction). At no point was I told what company Lionbridge was contracted by (although it was obvious if you thought about it) nor what the purpose of the work was. It’s only in retrospect that I realize the work was about much more than improving search algorithms. I sort of feel like a chump actually that I didn’t see the point.

      8 votes
  2. [3]
    ourari
    Link
    I guess even Tildes is now just a hive of worker bees to and for Google?

    Google updated its privacy policy over the weekend, explicitly saying the company reserves the right to scrape just about everything you post online to build its AI tools. If Google can read your words, assume they belong to the company now, and expect that they’re nesting somewhere in the bowels of a chatbot.

    I guess even Tildes is now just a hive of worker bees to and for Google?

    15 votes
    1. [2]
      Bonooru
      Link Parent
      Presumably, it's "only" the things that are being indexed. Though it is concerning either way.

      Presumably, it's "only" the things that are being indexed. Though it is concerning either way.

      5 votes
      1. [2]
        Comment deleted by author
        Link Parent
        1. R51
          Link Parent
          It should absolutely be noindex'd. AI is death.

          It should absolutely be noindex'd. AI is death.

          3 votes
  3. [7]
    bioemerl
    Link
    And that's how it should be. As long as Google's not trying to lock at all behind a paid API they're perfectly in the right to scrape information from the web and use it to train bots. It's on the...

    And that's how it should be. As long as Google's not trying to lock at all behind a paid API they're perfectly in the right to scrape information from the web and use it to train bots.

    It's on the web, it's public, if you don't want people to be scraping it don't put it on the open internet.

    Internet is an open place people, you should expect it to be used for things like training AI. Google is kind of a good guy here, and they would be especially if the good guy if they made their dataset open, because they are not acting like open AI or Reddit or Twitter and trying to lock away your information so that they can make a profit.

    Instead they're going to take that information they use and turn it into a useful tool. They're progressing humanity by doing so, and we should be happy with it.

    Although I would prefer they make any data they do scrape and open data set so that other people can benefit from it as well and remain in the open spirit.

    I also would prefer that they make any AI models they train open so that other people can run them and use them.

    13 votes
    1. [4]
      ourari
      (edited )
      Link Parent
      That amounts to victim blaming, imo. Yes, it happens. Yes, we know that's how it works in practice, but it doesn't seem legal or ethical for Google to do that. When I post something to Reddit, I...

      It's on the web, it's public, if you don't want people to be scraping it don't put it on the open internet.

      That amounts to victim blaming, imo. Yes, it happens. Yes, we know that's how it works in practice, but it doesn't seem legal or ethical for Google to do that.

      When I post something to Reddit, I grant Reddit the license to use it. I don't grant Google anything. Scraping everything is not fair use, either.

      Scraping for their search engine is different, because that exists to point people to the original source. But taking anything and everything to use it without license or consent to train AI would probably run afoul of data protection and copyright laws.
      Unfortunately, by the time a judge has had the chance to say anything about it, the damage will have been done, billions will have been made, and the inevitable fines will just be the cost of doing business. Google's past privacy and antitrust violations have shown as much.

      10 votes
      1. nacho
        Link Parent
        I want to push back against this instinct. These are not reasonable views to hold in a liberal western society. There is absolutely nothing shady, unexpected or illegal about scraping anything and...
        • Exemplary

        It's on the web, it's public, if you don't want people to be scraping it don't put it on the open internet.

        That amounts to victim blaming, imo. Yes, it happens. Yes, we know that's how it works in practice, but it doesn't seem legal or ethical for Google to do that.

        I want to push back against this instinct. These are not reasonable views to hold in a liberal western society. There is absolutely nothing shady, unexpected or illegal about scraping anything and everything that's public information.


        Scraping everything is not fair use, either.

        Per the Google books case, in the US that's exactly what it is.

        In the EU, the Directive on Copyright in the Digital Single Market (DSM) from 2019 explicitly codified the legality of text and data-mining. This was not revolutionary new law, but for most EU jurisdictions, clearing up of common precedents.


        "Browsewrap agreements", those would be terms of use for a website that you don't actually agree to before use, probably aren't legally binding. So sites that expose content to the public have to expect to be scraped.

        The whole model of our societies is that through a flow of public information, like business registers, property registers, news, economic data, research, books, public records of all sorts, court judgements, and so on in pretty much every field, businesses and people can make the most informed decisions about their lives.

        The whole idea is that everyone in a (regulated) free market will use all public information, and that's how we ensure fair competition, effective markets and society progresses.

        Copyrights are rights that are seriously, seriously limited by fair uses. Many types of fair use or indirect use of copyrighted material allow me to systematically profit, or even build businesses around the copyrighted IP of others. There are several industries where markets do not expect that cutting edge technology will be made available publicly (like through patents) before a business has come further by at least a couple generations of improvement. That is because registering/securing IP rights in itself makes competitively advantageous data available to others publicly.

        Many actions we perform are by definition public actions, whether we want them to be or not. Those are the foundations of our society.

        In this context it would be absolutely absurd to hold that online content would somehow play by totally different rules.


        Now I don't doubt many have the same intuition that "I don't like others scraping all the stuff I've shared online!"

        I think especially in the start of social media, a lot of people did not actually understand that they were publishing something publicly in ways that could garner way, way more viewership than being on the front page of the local paper, or an announcement over the PR system at school.

        Now in hindsight, I think a lot of people just don't like that others can make money off things they posted in public for free, or they somehow feel the right to a cut.

        That's not how it works in the rest of society, so it's an unreasonable expectation online too.

        9 votes
      2. unkz
        Link Parent
        Reddit lets Google scrape. They are perfectly capable of disabling scraping — Google very clearly makes this possible via robots.txt and other mechanisms. In fact, Reddit very much wants Google...

        Reddit lets Google scrape. They are perfectly capable of disabling scraping — Google very clearly makes this possible via robots.txt and other mechanisms. In fact, Reddit very much wants Google scraping their content.

        4 votes
      3. bioemerl
        Link Parent
        It's both legal and ethical. Copyright doesn't protect the use of material for training, it only protects the ability to reproduce and redistribute or "derivative works" (which implies things like...

        but it doesn't seem legal or ethical for Google to do that.

        It's both legal and ethical. Copyright doesn't protect the use of material for training, it only protects the ability to reproduce and redistribute or "derivative works" (which implies things like sequels and modifications, not AI models which serve totally different purpose and exist in a totally different category).

        When you post to Reddit you give permission to them to redistribute what you post, and they produce a copy which they give to Google. Google doesn't violate copyright.

        The only time a copyright claim might be valid here is for those who can prove their works are generated by the AI directly in a case of overfitting, in which case they have a valid argument that Google is actually redistributing their stuff without permission.

        by the time a judge has had the chance to say anything about it,

        In all certainly the judge will say it's fine.

        1 vote
    2. [2]
      rish
      (edited )
      Link Parent
      For now. When they can get away with it they'll stop playing the good guy. They always do. Look at Google Chrome, and Android now. They used manifest 3.0 to kill the adblocker. They are trying...

      Google is kind of a good guy

      For now. When they can get away with it they'll stop playing the good guy. They always do.

      Look at Google Chrome, and Android now. They used manifest 3.0 to kill the adblocker. They are trying best to make Android walled garden like iOS.

      Edit.

      Wrong edit removed

      6 votes
      1. bioemerl
        Link Parent
        This is true. But I'll take for now over the companies actively locking things down and keeping people out. I would happily take any other alternative that doesn't do this but unfortunately we...

        This is true. But I'll take for now over the companies actively locking things down and keeping people out.

        I would happily take any other alternative that doesn't do this but unfortunately we don't have that much choice right now.

  4. skybrian
    Link
    You can see the diffs here. These look like minor updates. Previously, they gave Google Translate as an example, and now Bard and Cloud AI too. The lawyers are making explicit something Google has...

    You can see the diffs here. These look like minor updates. Previously, they gave Google Translate as an example, and now Bard and Cloud AI too.

    The lawyers are making explicit something Google has done since it was founded. PageRank could be seen as an early form of machine learning where article popularity is learned from links. Early web search engines didn't ask for permission for downloading web pages. There is an opt-out standard using robots.txt. Google will stop if a website asks, and Mastodon even has a checkbox where you can opt out of being indexed, which is a nice feature.

    Using robots.txt or maybe marking links with "nofollow" is the way to go as far as Google is concerned.

    I think Google's results getting worse is a bad thing, assuming it's really happening. (I'm not sure about that.) I don't write for search engines or AI's, but I don't feel the need for any compensation if the writing I publish makes searches a tiny bit better, and that goes for making Bard or ChatGPT a tiny bit better too.

    (I've been very well compensated by Google already, though, so I'm biased.)

    6 votes
  5. [5]
    Bipolar
    Link
    When I first started using the internet it was understood that everything you posted to the internet was on here forever, I mean that's not really true and the internet is very different now but I...

    When I first started using the internet it was understood that everything you posted to the internet was on here forever, I mean that's not really true and the internet is very different now but I used to feel if it's indexable by google or any other crawler you are in a public space and public space rules apply so don't say or post anything you don't wouldn't say in public or want it to be recorded.

    5 votes
    1. [4]
      ourari
      Link Parent
      I think that became a popular concept a few years after I dialed in for the first time. But something being on here 'forever' as historical record doesn't necessarily grant a company permission to...

      When I first started using the internet it was understood that everything you posted to the internet was on here forever

      I think that became a popular concept a few years after I dialed in for the first time. But something being on here 'forever' as historical record doesn't necessarily grant a company permission to lay claim to all of it to use for profit.

      you are in a public space and public space rules apply so don't say or post anything you don't wouldn't say in public or want it to be recorded.

      There are cultural and jurisdictional differences about privacy in public. Here in Europe several countries have laws that govern what can and cannot be recorded in public. It's not a lawless free-for-all.

      3 votes
      1. Bipolar
        Link Parent
        True it's a very americentric prospective, but it's one of the few thing I think we do better than Europe then again that's a matter of preference and culture really. However, it's the way I use...

        True it's a very americentric prospective, but it's one of the few thing I think we do better than Europe then again that's a matter of preference and culture really. However, it's the way I use the internet, I try to limit what I post to things I wouldn't care if someone else reuses. I see why people are mad and I honestly don't know what's the solution is. I was just reading about how gizmodo is doing away with human writers for Ai written articles....

        I also just don't see how US and even EU companies can compete in the AI space if they don't do this. I mean I don't know anything about how these models are trained and/or if they really need all that data so maybe there is solution that works and also doesn't use all the indexable data in the internet. but all that data will be use in China and in the US by the alphabet soup of US intelligent agencies anyways, they are probably already training models in their massive data centers.

        1 vote
      2. [2]
        Kitahara_Kazusa
        Link Parent
        Wouldn't that be kind of a meaningless distinction if Google's AI division is based in the US and thus doesn't have to follow European law? I'm not an expert on how Google is structured but these...

        Wouldn't that be kind of a meaningless distinction if Google's AI division is based in the US and thus doesn't have to follow European law? I'm not an expert on how Google is structured but these multinationals generally have branches that only need to obey local laws.

        Ie, Apple's smartphone factories in China don't need to follow EU safety standards

        1 vote
        1. ourari
          (edited )
          Link Parent
          If they use data of people who reside in the EU, they need to be in compliance with whatever the current EU-U.S. data sharing pact is (it's been struck down twice by European courts thanks to Max...

          If they use data of people who reside in the EU, they need to be in compliance with whatever the current EU-U.S. data sharing pact is (it's been struck down twice by European courts thanks to Max Schrems/NOYB.eu). (related news)

  6. stu2b50
    Link
    Wasn't this always the case? It's not like Google uses PageRank still. It's well known that the search algorithm uses machine learning (which is such a broad category, it's not particularly saying...

    Wasn't this always the case? It's not like Google uses PageRank still. It's well known that the search algorithm uses machine learning (which is such a broad category, it's not particularly saying much - k-means clusters is considered machine learning) these days, and if you're not using the contents of the sites, what are you using for features?

    "Artificial intelligence" isn't a real definition with meaning. Presumably they just mean for an LLM instead of the search algorithm.

    5 votes
  7. the9tail
    Link
    I mean that’s what all LLMs are right? Scraping the internet for language patterns? Google is just modifying what they already do for a search engine and attaching a brain to put it all together.

    I mean that’s what all LLMs are right? Scraping the internet for language patterns?

    Google is just modifying what they already do for a search engine and attaching a brain to put it all together.

    3 votes
  8. [2]
    seanbon
    Link
    Time to create a non-gmail account, sigh. Think I'll be checking out Proton Mail: https://www.privacyguides.org/en/email/

    Time to create a non-gmail account, sigh.

    Think I'll be checking out Proton Mail: https://www.privacyguides.org/en/email/

    2 votes
    1. rickartz
      Link Parent
      You'll have to get out of Tildes too, because we are being indexed by Google. But it would be very sad for you to go from a lovely site like this one just because everyone is scrapping information...

      You'll have to get out of Tildes too, because we are being indexed by Google. But it would be very sad for you to go from a lovely site like this one just because everyone is scrapping information all over the internet.

      I have heard good things about Proton, so that could be a good thing. Just don't over do it and stop using all services you like because some megacorp is ruining everything. Don't let them ruin it for you.

      4 votes
  9. SuperJerms
    Link
    I really don't understand the negative reaction to LLMs using other people's work, particularly invoking the question of copyright. Copyright exists, "to promote the progress of science and useful...

    I really don't understand the negative reaction to LLMs using other people's work, particularly invoking the question of copyright.

    Copyright exists, "to promote the progress of science and useful arts." It's tough to think of a better example of this progress...not just because of the direct output of the machine itself, but also the "force multiplier" aspect of what happens when such a productive tool becomes widely available. On this explosion day, who better to speak to this idea then the king of marches, John Phillips Sousa and his essay, "The menace of mechanical music."

    Time and again throughout history, creatives fear that technology will run them out of business. Time and again, technology that makes reproduction easier, better, and more accessible has led to infinitely more creation than before that technology existed.

    But say one disagrees. From a law standpoint, we've got a reasonably clear path to knowing if copyright is either violable (e.g. satire and education) or is inapplicable (e.g. transformative works). A publicly discovered dataset being used to train something and not being published seems like the very definition of an educational use. Output of a machine that creates something new from many sources seems like the definition of transformative.

    From a "profit" standpoint, there are also plenty of cases defining lines where a derivative work does or doesn't infringe. For one, there's the question of potential markets. Oversimplifying a bit here, but making money from someone's work explicitly isn't wrong unless it's depriving them of potential income. If I gave those words to Twitter for free, I'm probably not harmed by an AI using them later, and whether OpenAI or Google made money from it along the way is moot.

    But it's not like AI just spits out those old tweets wholesale, anyway. They synthesize, remix, and ultimately create something new based on patterns in style, and old ideas. I'm skeptical that any ideas expressed within the last 20,000 years are actually new, but even if they were, the relevant government body would say, "copyright does not protect facts, ideas, systems, or methods of operation."

    1 vote
  10. kingthrillgore
    Link
    I'm already moving off Google for everything, am I gonna haveta use robots.txt at this point?

    I'm already moving off Google for everything, am I gonna haveta use robots.txt at this point?

    3 votes
  11. Comment removed by site admin
    Link