28 votes

As consumers switch from Google Search to ChatGPT, a new kind of bot is scraping data for AI

18 comments

  1. [13]
    Well_known_bear
    Link
    I was at a meeting recently where a co-worker mentioned that she was about to visit LA, but wasn't concerned about the protests as she had "asked ChatGPT and confirmed it was all localised". I...

    I was at a meeting recently where a co-worker mentioned that she was about to visit LA, but wasn't concerned about the protests as she had "asked ChatGPT and confirmed it was all localised". I just nodded at the time, but inside I was a bit shocked that this highly educated professional was willing to just take the word of an LLM rather than check a primary source.

    I have a feeling that this preference for checking the underlying search results (which I fully acknowledge are also increasingly AI generated and can themselves be wrong) is going to be one of those things that divides my generation from those that follow it.

    30 votes
    1. [8]
      CptBluebear
      Link Parent
      That ended up being the right answer but yeah, people "@grok can you verify this" on a lot of Twitter posts about LA and the bot just makes stuff up as it goes. It's weird to me people use it like...

      That ended up being the right answer but yeah, people "@grok can you verify this" on a lot of Twitter posts about LA and the bot just makes stuff up as it goes.

      It's weird to me people use it like a verification machine when the word "hallucination" is so ubiquitous with LLMs. And it's not like that's a secret.

      19 votes
      1. [3]
        slade
        Link Parent
        When you consider what people used for verification machines prior to LLMs, it becomes less surprising. I'm not trying to be edgy, but everything from tabloid news to talkshow hosts to magic...

        It's weird to me people use it like a verification machine when the word "hallucination" is so ubiquitous with LLMs. And it's not like that's a secret.

        When you consider what people used for verification machines prior to LLMs, it becomes less surprising. I'm not trying to be edgy, but everything from tabloid news to talkshow hosts to magic crystals and throwing bones, I think a lot of people just want easy answers, not truth.

        19 votes
        1. [2]
          Soggy
          Link Parent
          And that's an impulse that we should be hammering out of people at the youngest age possible.

          And that's an impulse that we should be hammering out of people at the youngest age possible.

          6 votes
          1. slade
            Link Parent
            Could not agree more. Intellectual curiosity is a critically important trait that needs to be fostered, both individually and at the societal level. Even when the answer IS easy, there's often...

            Could not agree more. Intellectual curiosity is a critically important trait that needs to be fostered, both individually and at the societal level. Even when the answer IS easy, there's often value in answering it the hard way. AI threatens that more than ever before.

            2 votes
      2. [4]
        babypuncher
        Link Parent
        To be fair, anyone still using Twitter in 2025 probably isn't very smart

        To be fair, anyone still using Twitter in 2025 probably isn't very smart

        2 votes
        1. [3]
          tauon
          Link Parent
          They may be smart, but addicted nonetheless. Also, network effects: While I do not necessarily agree in the sense of “we shouldn’t change anything about it”, there’s still a ton of experts out...

          They may be smart, but addicted nonetheless.

          Also, network effects: While I do not necessarily agree in the sense of “we shouldn’t change anything about it”, there’s still a ton of experts out there who are only available on that singular platform in terms of online presences.

          1 vote
          1. [2]
            babypuncher
            Link Parent
            I decided a while ago that an expert who hasn't moved on from Xitter probably isn't worth listening to. By still hanging out there, they are actively contributing to the problem.

            I decided a while ago that an expert who hasn't moved on from Xitter probably isn't worth listening to. By still hanging out there, they are actively contributing to the problem.

            2 votes
            1. tauon
              Link Parent
              Definitely agree for myself! I’ll use nitter/xancel/whichever site they haven’t brought down yet for the occasional post/thread I’m getting sent to me. … Unfortunately, not all experts see it that...

              Definitely agree for myself! I’ll use nitter/xancel/whichever site they haven’t brought down yet for the occasional post/thread I’m getting sent to me.

              … Unfortunately, not all experts see it that way among their own peer groups, either.

              1 vote
    2. JCAPER
      Link Parent
      In my experience, at least with top of the line models Gemini Pro 2.5, these LLMs are really good at sticking with what the source says. But, to be clear, I'm talking about top of the line models,...

      In my experience, at least with top of the line models Gemini Pro 2.5, these LLMs are really good at sticking with what the source says. But, to be clear, I'm talking about top of the line models, not lightweight, like the LLM that google search uses

      But even so, I wouldn't blindly trust its output. There are some issues that I've noticed with time.

      For example, when it comes to memory, they're good at remembering and paraphrasing what the source says, but I've noticed a few times things like:

      • misinterpreting what the source said;
      • conflating points
      • subtle gaps in logic (I'll explain this one at the end, it's a bit weird to explain)

      For my daily use, and as a fan of notebooklm, I generally trust the LLM when we're talking about topics that I'm familiar with and that are simple. Complex technical documents or topics that I know nothing about, I still use it to have a general idea of what it's about but I don't trust its conclusions, instead I use it to help me pinpoint whatever I'm looking for.

      *Subtle gaps in logic:

      imagine a situation in a story where it started snowing outside, and character A is inside a room without windows. The text does not mention anything about if character A knows that it's snowing or not, it just says that it started snowing, and character A is inside a closed room.

      We, humans, know that that character A can't know that it's snowing, because it's implied that they are inside a room without windows, so they couldn't see the snow outside.

      The LLM, if you ask this question directly "Does character A know that it's snowing?", it will likely answer correctly. It will "think" about it, reach the same conclusion as above, and say no.

      BUT, if you ask something else, where the main point isn't about if character A knows that it's snowing or not, it may mistake and say/imply that they know. For example, imagine that you ask "how is the character A feeling? Write a dialogue line from them.", it may answer: "I can feel the familiar cold from the snow outside".

      16 votes
    3. [2]
      Raspcoffee
      Link Parent
      Yeah, to a sense it reminds me of how people said to not automatically trust anything that's mentioned on the Internet. It's correct as it was then as it is now, and we're still learning on how to...

      I just nodded at the time, but inside I was a bit shocked that this highly educated professional was willing to just take the word of an LLM rather than check a primary source.

      Yeah, to a sense it reminds me of how people said to not automatically trust anything that's mentioned on the Internet. It's correct as it was then as it is now, and we're still learning on how to deal with it - while at the same time now also having to contend with LLMs.

      Pair that with how, as you said, many of the search results are also now generated by ML algorithms, and we're seeing this trend of their hallucinations going to permeate society to some degree. I have no idea how much or what it'll look like. Propaganda, misinformation, and all those things have occurred in history yes. But not in this industrialized sense with hallucinations and lack of bad faith occasionally causing the same issues by chance.

      ...please let me be wrong here.

      8 votes
      1. slade
        Link Parent
        The real problem, as it's often the case, is that there's a financial incentive to produce slop, even with hallucinations. Until that's not the case, I don't think you're wrong.

        The real problem, as it's often the case, is that there's a financial incentive to produce slop, even with hallucinations. Until that's not the case, I don't think you're wrong.

        4 votes
    4. [2]
      Comment deleted by author
      Link Parent
      1. papasquat
        Link Parent
        I don't use ChatGPT, but most popular LLM interfaces have web grounding now, so they'll recognize when a question is outside of their temporal bounds, or when they just need more verifiable info,...

        I don't use ChatGPT, but most popular LLM interfaces have web grounding now, so they'll recognize when a question is outside of their temporal bounds, or when they just need more verifiable info, and will launch a web search and summarize it. It's not a bad way to sift through the internet, as long as you're actually verifying the links it's sourcing from.

        18 votes
  2. skybrian
    Link
    From the article:

    From the article:

    To offer users a tidy AI summary instead of Google’s “10 blue links,” companies such as OpenAI and Anthropic have started sending out bots to retrieve and recap content in real time. They are scraping webpages and loading relevant content into the AI’s memory and “reading” far more content than a human ever would.

    According to data shared exclusively with The Washington Post, traffic from retrieval bots grew 49 percent in the first quarter of 2025 from the fourth quarter of 2024. The data is from TollBit, a New York-based start-up that helps news publishers monitor and make money when AI companies use their content.

    7 votes
  3. [4]
    winther
    Link
    Won’t this just lead to more sites putting things behind a paywall or at least login? With Google there was at least some give and take benefit for websites to allow indexing. This seems like a...

    Won’t this just lead to more sites putting things behind a paywall or at least login? With Google there was at least some give and take benefit for websites to allow indexing. This seems like a very short sighted model that will just give worse and worse result in the immediate future as there will be no sites for the bots to scrape.

    7 votes
    1. V17
      Link Parent
      It 100% will and I'm not convinced that this is entirely a bad thing. I just recently complained here about how I hate the power and unpredictability that Google has over the viability of other...

      Won’t this just lead to more sites putting things behind a paywall or at least login?

      It 100% will and I'm not convinced that this is entirely a bad thing.

      I just recently complained here about how I hate the power and unpredictability that Google has over the viability of other people's projects - many websites rely on advertising income and the majority of their traffic is from Google, so if Google unexptectedly changes its algorithm and they lose 60% of traffic overnight (happened recently), they either quickly change their monetization model or die. Really bad situation.

      But Google got us into this situation in the first place! The business model of "Google will lead users into my website so that I can serve them ads" with content being secondary is a direct result of that. And it's not a good model, in many ways it sucks for the user, it's what created blogspam, clickbait headlines, articles that hide their most important piece of information around the bottom third of the text and lots of other shit.

      Paywalls reduce the need to rely on some of these shitty approaches and force the users to actually think about what quality of content they want to spend money on. As long as the "serving ads to google users" model works, there are going to be many free shitty alternatives to paid services, but if the move towards AI search pushes most of those out of the market as a side effect, it could undoubtedly have some positive effects.

      5 votes
    2. Berdes
      Link Parent
      I think it will depends on how well websites are able to defend their rights in courts. There have been a lengthy legal battle between Google and news publishers that rested on the fact that...

      I think it will depends on how well websites are able to defend their rights in courts.

      There have been a lengthy legal battle between Google and news publishers that rested on the fact that Google wasn't allowed to use the full title, snippet and images of news article in the search result page or on Google News without properly remunerating the publishers (https://www.medialaws.eu/google-vs-publishers-french-competition-authority-weighs-in-on-the-online-use-of-press-publications-also-in-ai-systems-with-a-decision-on-the-verge-of-antitrust-and-copyright/). This was based on a French law implementing EU directive n°2019/790, so it's likely that similar outcomes will be possible in most EU countries.

      While most of this legal battle predates LLMs becoming mainstream, I would find it very surprising if the same court would find it acceptable for AIs to scrape and summarize news articles (or any other copyrighted content, really) without proper compensation to the right holders. The decision only mentioned the use of news articles for training and found it problematic only in the sense that Google wasn't transparent about that behavior and the court didn't say anything about the behavior itself.

      I think the case of news article is a bit special in the sense that people are likely to only look at the headline and snippet to get enough information, while this is less likely for other types of contents.

      5 votes
    3. skybrian
      Link Parent
      I expect that AI companies will have to start paying some websites to get access for queries done by ther paid users. Since they charge a subscription fee, they can afford to pass some of it along.

      I expect that AI companies will have to start paying some websites to get access for queries done by ther paid users. Since they charge a subscription fee, they can afford to pass some of it along.

      5 votes