28
votes
As consumers switch from Google Search to ChatGPT, a new kind of bot is scraping data for AI
Link information
This data is scraped automatically and may be incorrect.
- Title
- 'This is coming for everyone': A new kind of AI bot takes over the web
- Published
- Jun 12 2025
- Word count
- 777 words
I was at a meeting recently where a co-worker mentioned that she was about to visit LA, but wasn't concerned about the protests as she had "asked ChatGPT and confirmed it was all localised". I just nodded at the time, but inside I was a bit shocked that this highly educated professional was willing to just take the word of an LLM rather than check a primary source.
I have a feeling that this preference for checking the underlying search results (which I fully acknowledge are also increasingly AI generated and can themselves be wrong) is going to be one of those things that divides my generation from those that follow it.
That ended up being the right answer but yeah, people "@grok can you verify this" on a lot of Twitter posts about LA and the bot just makes stuff up as it goes.
It's weird to me people use it like a verification machine when the word "hallucination" is so ubiquitous with LLMs. And it's not like that's a secret.
When you consider what people used for verification machines prior to LLMs, it becomes less surprising. I'm not trying to be edgy, but everything from tabloid news to talkshow hosts to magic crystals and throwing bones, I think a lot of people just want easy answers, not truth.
And that's an impulse that we should be hammering out of people at the youngest age possible.
Could not agree more. Intellectual curiosity is a critically important trait that needs to be fostered, both individually and at the societal level. Even when the answer IS easy, there's often value in answering it the hard way. AI threatens that more than ever before.
To be fair, anyone still using Twitter in 2025 probably isn't very smart
They may be smart, but addicted nonetheless.
Also, network effects: While I do not necessarily agree in the sense of “we shouldn’t change anything about it”, there’s still a ton of experts out there who are only available on that singular platform in terms of online presences.
I decided a while ago that an expert who hasn't moved on from Xitter probably isn't worth listening to. By still hanging out there, they are actively contributing to the problem.
Definitely agree for myself! I’ll use nitter/xancel/whichever site they haven’t brought down yet for the occasional post/thread I’m getting sent to me.
… Unfortunately, not all experts see it that way among their own peer groups, either.
In my experience, at least with top of the line models Gemini Pro 2.5, these LLMs are really good at sticking with what the source says. But, to be clear, I'm talking about top of the line models, not lightweight, like the LLM that google search uses
But even so, I wouldn't blindly trust its output. There are some issues that I've noticed with time.
For example, when it comes to memory, they're good at remembering and paraphrasing what the source says, but I've noticed a few times things like:
For my daily use, and as a fan of notebooklm, I generally trust the LLM when we're talking about topics that I'm familiar with and that are simple. Complex technical documents or topics that I know nothing about, I still use it to have a general idea of what it's about but I don't trust its conclusions, instead I use it to help me pinpoint whatever I'm looking for.
*Subtle gaps in logic:
imagine a situation in a story where it started snowing outside, and character A is inside a room without windows. The text does not mention anything about if character A knows that it's snowing or not, it just says that it started snowing, and character A is inside a closed room.
We, humans, know that that character A can't know that it's snowing, because it's implied that they are inside a room without windows, so they couldn't see the snow outside.
The LLM, if you ask this question directly "Does character A know that it's snowing?", it will likely answer correctly. It will "think" about it, reach the same conclusion as above, and say no.
BUT, if you ask something else, where the main point isn't about if character A knows that it's snowing or not, it may mistake and say/imply that they know. For example, imagine that you ask "how is the character A feeling? Write a dialogue line from them.", it may answer: "I can feel the familiar cold from the snow outside".
Yeah, to a sense it reminds me of how people said to not automatically trust anything that's mentioned on the Internet. It's correct as it was then as it is now, and we're still learning on how to deal with it - while at the same time now also having to contend with LLMs.
Pair that with how, as you said, many of the search results are also now generated by ML algorithms, and we're seeing this trend of their hallucinations going to permeate society to some degree. I have no idea how much or what it'll look like. Propaganda, misinformation, and all those things have occurred in history yes. But not in this industrialized sense with hallucinations and lack of bad faith occasionally causing the same issues by chance.
...please let me be wrong here.
The real problem, as it's often the case, is that there's a financial incentive to produce slop, even with hallucinations. Until that's not the case, I don't think you're wrong.
I don't use ChatGPT, but most popular LLM interfaces have web grounding now, so they'll recognize when a question is outside of their temporal bounds, or when they just need more verifiable info, and will launch a web search and summarize it. It's not a bad way to sift through the internet, as long as you're actually verifying the links it's sourcing from.
From the article:
Won’t this just lead to more sites putting things behind a paywall or at least login? With Google there was at least some give and take benefit for websites to allow indexing. This seems like a very short sighted model that will just give worse and worse result in the immediate future as there will be no sites for the bots to scrape.
It 100% will and I'm not convinced that this is entirely a bad thing.
I just recently complained here about how I hate the power and unpredictability that Google has over the viability of other people's projects - many websites rely on advertising income and the majority of their traffic is from Google, so if Google unexptectedly changes its algorithm and they lose 60% of traffic overnight (happened recently), they either quickly change their monetization model or die. Really bad situation.
But Google got us into this situation in the first place! The business model of "Google will lead users into my website so that I can serve them ads" with content being secondary is a direct result of that. And it's not a good model, in many ways it sucks for the user, it's what created blogspam, clickbait headlines, articles that hide their most important piece of information around the bottom third of the text and lots of other shit.
Paywalls reduce the need to rely on some of these shitty approaches and force the users to actually think about what quality of content they want to spend money on. As long as the "serving ads to google users" model works, there are going to be many free shitty alternatives to paid services, but if the move towards AI search pushes most of those out of the market as a side effect, it could undoubtedly have some positive effects.
I think it will depends on how well websites are able to defend their rights in courts.
There have been a lengthy legal battle between Google and news publishers that rested on the fact that Google wasn't allowed to use the full title, snippet and images of news article in the search result page or on Google News without properly remunerating the publishers (https://www.medialaws.eu/google-vs-publishers-french-competition-authority-weighs-in-on-the-online-use-of-press-publications-also-in-ai-systems-with-a-decision-on-the-verge-of-antitrust-and-copyright/). This was based on a French law implementing EU directive n°2019/790, so it's likely that similar outcomes will be possible in most EU countries.
While most of this legal battle predates LLMs becoming mainstream, I would find it very surprising if the same court would find it acceptable for AIs to scrape and summarize news articles (or any other copyrighted content, really) without proper compensation to the right holders. The decision only mentioned the use of news articles for training and found it problematic only in the sense that Google wasn't transparent about that behavior and the court didn't say anything about the behavior itself.
I think the case of news article is a bit special in the sense that people are likely to only look at the headline and snippet to get enough information, while this is less likely for other types of contents.
I expect that AI companies will have to start paying some websites to get access for queries done by ther paid users. Since they charge a subscription fee, they can afford to pass some of it along.