35 votes

Perplexity AI is using stealth, undeclared crawlers to evade website no-crawl directives

19 comments

  1. [16]
    Wes
    Link
    This isn't necessarily new information, as it's been documented before, and is even something we've discussed on Tildes. Perplexity respects robots.txt specifically for running their crawler, and...

    This isn't necessarily new information, as it's been documented before, and is even something we've discussed on Tildes.

    Perplexity respects robots.txt specifically for running their crawler, and not when following user directives. This is the correct thing to do, as robots.txt does not apply to human requests. When you visit a website, your browser isn't checking robots.txt either.

    We often consider browsers and user agents to be the same thing, but they're not. The idea of a user agent goes back to the early days of the web. A user agent is a piece of software that acts on the user's behalf, retrieves information, interprets it, and displays it to the user. A browser is a user agent, but so is an AI acting on a request.

    This is not the same as a crawler that is used to train a model. Crawlers are automated tools that should follow robots.txt. They contribute to a significant amount of online traffic, and websites have every right to block them. If a crawler is ignoring robots.txt, that's a problem.

    Cloudflare sells a service to stop these crawlers, and they're incorrectly extending it to requests that users are directly invoking. In the article, the user specifically asks Perplexity to fetch the "hidden page", and it's dutifully doing so. Again, this is the correct action. It doesn't matter if Perplexity is using unpublished IPs here, because they're not acting as an autonomous crawler. It's the same as a user directly fetching the page.

    At the moment, there is a bad actor that is strangling web servers with constant crawling for AI training. They are hitting way too often, using residential IPs, and ignoring robots.txt. We don't know who it is. I asked the lead Anubis dev if they had any guesses, but they didn't. It's seemingly not one of the big players though, and that's not what Perplexity is doing here.

    20 votes
    1. [12]
      GoatOnPony
      (edited )
      Link Parent
      Edit: I just read Perplexity's response which indicates that this is tool use! I'll leave the rest of my comment in place for reference purposes, but the first paragraph is inaccurate. Obviously...

      Edit: I just read Perplexity's response which indicates that this is tool use! I'll leave the rest of my comment in place for reference purposes, but the first paragraph is inaccurate. Obviously put too much trust in to Cloudflare to have addressed that as an option.

      I expect that if the answer were as simple as 'perplexity makes a request to the website as part of some tool use in the immediate response to a query about the site' then Cloudflare would have said as much - it would be both obvious to them (timing and number of requests) and easy for perplexity to say as much as a defense. Cloudflare claims that blocking the undisclosed crawler caused a reduction in answer quality in perplexity answers about their honeypot sites, so that seems like causal link I'd not just hand wave away. Their estimates that the undisclosed, bad behaving crawler is making about 1/10 as many requests as the well behaved crawler (3-6 million queries per day) - that seems like enough traffic to complain about too. While Cloudflare is attempting to sell a product I also think they've presented a reasonable theory that perplexity is indeed running a badly behaving crawler and not just doing normal user agent things.

      Separately, I don't personally think that all user agents are fine/should have equal access to sites. I think websites should be respected in their choice to filter and block AI based user agents, regardless of whether from crawlers or tool use. Given that, using robots.txt as a weak signposting seems reasonable even if the RFC only talks about crawlers (it does reference user agents as how crawlers declare themselves though FWIW). So even if this does turn out to be tool use in response to a user query I think perplexity should still respect robots.txt given that's currently the easiest way for website operators to express intent about whether AI access is acceptable or not. If a new specification comes along which supplants robots.txt for the purpose of informing user agents about acceptable behavior, the perhaps perplexity can ignore robots.txt and only look at the new spec during tool use, but in the meanwhile AFAIK robots.txt is the best operators have.

      8 votes
      1. [11]
        Wes
        Link Parent
        I feel Cloudflare is intimating that, but their examples don't really show it. Their smoking gun is asking the agent to fetch a specific page, and the agent complying. I agree the traffic numbers...

        I feel Cloudflare is intimating that, but their examples don't really show it. Their smoking gun is asking the agent to fetch a specific page, and the agent complying. I agree the traffic numbers seem odd, but my biggest issue is their primary argument is conflating two different issues.

        To be clear, if Anthropic is using any form of cloaking for automated crawling, then I'd absolutely agree that they're being poor netizens. Scraping is a legal activity, but it shouldn't be done deceptively or at large scales that can impact hosting.

        Regarding websites blocking specific client agents, I'm not a big fan of that idea. I feel the open web thrives best when it's left accessible. Even when a website is designed for a specific experience (eg. say using a new API that Firefox hasn't implemented yet), I'd still disagree with outright blocking Firefox users. You never know when Fx will get that capability, and even if it can't display it fully, it might still work in a lesser form.

        AI agents can't currently do everything that users can, but they're progressing rapidly. It's not hard to imagine a future where we have agents running tasks for us like filling forms, scheduling appointments, and updating our calendars automatically. Even if you're not into the idea of that yourself, try to imagine how useful it will be for people with limited vision or dexterity. For many, simply interacting with a calendar application can be a difficult task.

        Honestly, I fear the web is already too siloed with services like Discord and social media dominating things. I don't want to see it locked down further. I'm a big fan of ideas like the semantic web, and of open data feeds like RSS.

        Regarding a successor to robots.txt, you may be interested in the Client Hints API. The user-agent string has definitely been mangled over the years (largely in response to server-side testing), so this is meant to be a clean ground-up rewrite. The CF post also mentioned a new form of request signing that I was unfamiliar with until now.

        5 votes
        1. [3]
          GoatOnPony
          Link Parent
          I'll agree that the blog post could use some more evidence that tool use is not happening, but I don't think the article is conflating? My read is that they are pretty sure that crawling is...

          I'll agree that the blog post could use some more evidence that tool use is not happening, but I don't think the article is conflating? My read is that they are pretty sure that crawling is happening as part of an indexing stage to feed data into the model. That indexing happens regardless of whether any particular user has asked about the site or it would be extremely obvious - they can check if they got requests to access the site from the bad behaving crawler before they issue any queries to the LLM or that they continue to get queries after the LLM request. The lack of any evidence that they checked is concerning but I'd still be quite surprised to hear that they didn't look in to that possibility.

          I too like the open web and semantic data, but I think user agents need to play a fine balancing act between respecting the autonomy of the user and the relationship that the website wants to have with that user. That relationship could be one of artistic intent, whimsy, user access controls, money (ads), etc. Most forms of user autonomy I have no problems with (accessibility, script blocking, javascript on/off, esoteric browsers, reader mode, etc) and in general I come down on the side of more user autonomy where the two sides come in to conflict. However, I think AI agents are offering a very different point on the autonomy vs relationship spectrum. They are currently heading down the path of their own form of siloing and large corporate interest - they want to intermediate all interactions between users and the current web in a way which obliterates any relationship a website can have with their users. I don't think AI agents or the companies creating them are neutral actors like browsers (which at least have a historical status quo, inertia, and arguable lower barriers to entry) and that non-neutral power of intermediation is terrifying. I think it will be very destabilizing to the web ecosystem (particularly when it comes to monetization) and dis-incentivizes the creation of small web content (I want people to read and interact with me, not an AI summary of what I made). AI agents may end up forcing more content into silos than before, so I support websites who don't want to partake in this particular experiment.

          TLDR, I like the semantic web but I think it should be opt in rather than being effectively forced on the web by big tech in ways which could ultimately remove user autonomy.

          4 votes
          1. [2]
            Wes
            Link Parent
            Fair and well-measured points. Thanks for sharing your view. I can understand your concern about creating new silos, and that's definitely not a future I want either. I'm a big proponent of...

            Fair and well-measured points. Thanks for sharing your view.

            I can understand your concern about creating new silos, and that's definitely not a future I want either. I'm a big proponent of running local models over renting them from cloud-hosting companies, and I hope that work continues in that space to make them easier and more accessible for end users.

            2 votes
            1. GoatOnPony
              Link Parent
              Well, turns out your analysis was correct! Perplexity's blog post response claims it was indeed tool use initiated in response to the user request. Left an edit on my original reply as well. I...

              Well, turns out your analysis was correct! Perplexity's blog post response claims it was indeed tool use initiated in response to the user request. Left an edit on my original reply as well. I think Cloudflare (and myself for defending their analysis) have some egg on face :)

              I think a world where the majority of AI agents are local models (and locally trained or at least tuned) would be a significant improvement. Probably would still have some not great consequences and the point about relationship management still applies, so I'm not sure which way my personal opinions would ultimately fall on them, but I'd probably not advocate against them like I do with centrally controlled models.

              3 votes
        2. [7]
          zod000
          Link Parent
          Based on my traffic stats, their crawlers are not actually following robots.txt properly as I have seen them hitting things that they should not and that I am positive no user could meaningfully...

          Based on my traffic stats, their crawlers are not actually following robots.txt properly as I have seen them hitting things that they should not and that I am positive no user could meaningfully direct them to. I had to start making explicit firewall rules to make them behave. They are not unique in this. I don't believe a single AI feeding crawler follows robots.txt anymore. Google's documentation pretty clearly states that they no longer do and event states that, I kid you not, if you don't want your site to be crawled then it should not be accessible from the internet. I ended up going as aggressive as I could on some proactive rules that wouldn't completely kill our SEO.

          1 vote
          1. [6]
            skybrian
            Link Parent
            That's news to me. Where do they say this in Google's documentation?

            That's news to me. Where do they say this in Google's documentation?

            1 vote
            1. [5]
              zod000
              Link Parent
              I don't recall exactly which documentation page I was reading it, but this one covers a lot of the same ground. https://developers.google.com/search/docs/crawling-indexing/robots/intro It even has...

              I don't recall exactly which documentation page I was reading it, but this one covers a lot of the same ground. https://developers.google.com/search/docs/crawling-indexing/robots/intro

              It even has a big red warning section that starts with " Don't use a robots.txt file as a means to hide your web pages (including PDFs and other text-based formats supported by Google) from Google search results. "

              1 vote
              1. [4]
                skybrian
                Link Parent
                Sure, but the second part is: So, they’re not saying that their crawler ignores robots.txt. They’re saying that it’s not a reliable way to keep URL’s out of Google Search results, because they do...

                Sure, but the second part is:

                If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex.

                So, they’re not saying that their crawler ignores robots.txt. They’re saying that it’s not a reliable way to keep URL’s out of Google Search results, because they do link to URL’s that they haven’t crawled.

                That’s not new. Google has worked that way as long as I can remember. Paying attention to incoming links is how PageRank works. If many web pages link to the same URL, then that’s a clue that it’s a useful search result. (Though of course, it can be gamed.)

                This is a warning to people who rely on security by obscurity. Sharing “private” links with no other access control than a random number in the URL is something everyone does. It’s quite practical for sending a link to friends and family. But if that URL gets published somewhere on the web than all bets are off.

                1 vote
                1. [3]
                  zod000
                  Link Parent
                  Well, the issue is that many of these "private" links are only available in the first place because of crawlers that did ignore robots.txt. The search engine crawlers feed off of each other, and...

                  Well, the issue is that many of these "private" links are only available in the first place because of crawlers that did ignore robots.txt. The search engine crawlers feed off of each other, and many of them use fuzzy logic to try to "guess" possible pages (I have hundreds of thousands of examples in my logs, Baidu is a major offender here as are some Tencent owned datacenter IP ranges that use bogus random user agents).

                  Google's advice for removing content from search results has three suggestions in order of preference:

                  1. delete the content
                  2. password protect the content
                  3. Add a noindex meta tag

                  What I find funny about #3 is that they must crawl the page in order to see if that tag exists, so essentially they 100% DO crawl pages that are linked despite what they say they do not do elsewhere.

                  1. [2]
                    skybrian
                    Link Parent
                    Yes, they recommend a noindex tag, but you might have to modify robots.txt for their crawler to see it? I don't think we can tell from reading this document.

                    Yes, they recommend a noindex tag, but you might have to modify robots.txt for their crawler to see it? I don't think we can tell from reading this document.

                    1 vote
                    1. Wes
                      Link Parent
                      Yes, you need to allow crawling of a page for Google to see a noindex tag. Otherwise they can only infer information from outside sources. The noindex and nofollow directives (or robots.txt) serve...

                      Yes, you need to allow crawling of a page for Google to see a noindex tag. Otherwise they can only infer information from outside sources. The noindex and nofollow directives (or robots.txt) serve different purposes.

                      Their Search Console tool for example lists both indexing and crawling as separate flags that the webmaster can set. Here's an example page I just pulled up:

                      Last crawl: 28 Jul 2025, 01:29:42
                      Crawled as: Googlebot smartphone
                      Crawl allowed?: Yes
                      Page fetch: Successful
                      Indexing allowed?: Yes
                      
                      1 vote
    2. [3]
      ShroudedScribe
      Link Parent
      I disagree here. Here's a couple scenarios that explain my perspective in depth. Scenario 1: A human accesses a website with their browser. This is legitimate behavior, not just because it...

      Perplexity respects robots.txt specifically for running their crawler, and not when following user directives. This is the correct thing to do, as robots.txt does not apply to human requests. When you visit a website, your browser isn't checking robots.txt either.

      I disagree here. Here's a couple scenarios that explain my perspective in depth.

      Scenario 1: A human accesses a website with their browser. This is legitimate behavior, not just because it produces an output that aligns with the website owner's intent, but also because it is for their personal use.

      Scenario 2: A human runs a script or command (something like wget) on a web page. This is also legitimate behavior, because it is human directed, and it is storing the page for the human's future use. From a technical perspective, it is using a similar amount of resources that visiting the website in a browser would (or less depending on if you're loading images, etc).

      Scenario 3: A human uses a proxy website to load a website in their browser. This too is legitimate behavior, because similar to Scenario 1, it's for the human's own use.

      Scenario 4: A human asks an externally hosted AI questions about a website. The AI/LLM then scrapes not just one, but multiple pages from the website. It then stores the information from the website to respond to similar questions from different users, while also maintaining the capability of scraping again. This is not legitimate behavior, because it is not just for the current user's benefit, it eliminates viewing the website itself, can misrepresent the content on the site, and intentionally attempts to reduce human visitors from navigating to their site.

      There's been a few articles on this, but the most succinct way I've seen this explained is that Google once aimed to be a website with the shortest on-page time. The intent was to provide search results and move the user away from Google and into one of the resulting sites. Now, with AI summaries, the intent has changed into keeping users on Google as long as possible, and reducing the number of visits to other sites.

      I like how @GoatOnPony said it:

      I want people to read and interact with me, not an AI summary of what I made

      3 votes
      1. [2]
        papasquat
        Link Parent
        I think this yet again illustrates a fundamental flaw with how content on the internet is monetized. The ad based funding model has been stretched so far past its breaking point that we're no...

        Now, with AI summaries, the intent has changed into keeping users on Google as long as possible, and reducing the number of visits to other sites.

        I think this yet again illustrates a fundamental flaw with how content on the internet is monetized. The ad based funding model has been stretched so far past its breaking point that we're no longer just in danger of obtrusive ads, clickbait articles, and deceptive SEO techniques, we're in danger of the whole house of cards falling down.

        We've needed, and still need a frictionless, interoperable way for website owners to ask for compensation for their work if high quality content has any chance of continuing to exist on the internet. It's one of the reasons why proprietary walled gardens like YouTube, Instagram and twitch have become where most content that people seek out lives. Creators use these platforms because it's the only way to make money creating content on the Internet. Hosting their own website is completely infeasible not just because of the technical considerations needed to scale, but because of the lack of easy monetization and discoverability.

        We're now facing the same issue with text based content as well. If blog writers, reporters, and so on had a secure way to get micropayments each time a request was made against their site, not only would they be totally fine with their content being scraped by LLMs, the amount of LLMs scraping their sites would drastically decrease.

        Web3.0 was supposed to solve these issues, but predictably that entire effort has been virtually all grifters and sleezeballs trying to scam people out of money.

        I don't think this issue gets solved without strong international governance putting something into place, but I have very little faith there either.

        I think instead, we'll start seeing the death of the independent web as we know it, and instead have something similar to Twitter or YouTube but for all types of long form blogging/news/whatever, since that's going to be the only way that people or groups of people can make a living posting things on the internet.

        4 votes
        1. ShroudedScribe
          Link Parent
          I made a comment about giving micropayments to content creators a while ago. I wasn't thinking about being paid per crawl/scrape, but that would be interesting.

          If blog writers, reporters, and so on had a secure way to get micropayments each time a request was made against their site, not only would they be totally fine with their content being scraped by LLMs, the amount of LLMs scraping their sites would drastically decrease.

          I made a comment about giving micropayments to content creators a while ago. I wasn't thinking about being paid per crawl/scrape, but that would be interesting.

          1 vote
  2. ShroudedScribe
    Link
    I'm proud of Cloudflare for naming and shaming. When website and content owners say they don't want their content included in AI responses, that needs to be respected.

    I'm proud of Cloudflare for naming and shaming.

    When website and content owners say they don't want their content included in AI responses, that needs to be respected.

    9 votes
  3. [2]
    skybrian
    Link
    For comparison, here's what a different AI crawler is doing: ChatGPT agent’s user-agent: ... So, crawlers can sign all their requests and then other crawlers can't really spoof them. Maybe that...

    For comparison, here's what a different AI crawler is doing:

    ChatGPT agent’s user-agent:

    At first glance it looks like ChatGPT is being dishonest here by not including its bot identity in the user-agent header. I thought for a moment it might be reflecting my own user-agent, but I’m using Firefox on macOS and it identified itself as Chrome.

    Then I spotted this header:

    Signature-Agent: "https://chatgpt.com"

    ...

    These [signature headers] turn out to come from a relatively new web standard: RFC 9421 HTTP Message Signatures’ published February 2024.

    The purpose of HTTP Message Signatures is to allow clients to include signed data about their request in a way that cannot be tampered with by intermediaries.

    So, crawlers can sign all their requests and then other crawlers can't really spoof them. Maybe that will become more common? And then websites could block any crawlers that don't sign their requests, and choose which ones they want to allow.

    They could still pretend to be regular users, though.

    9 votes
    1. ShroudedScribe
      Link Parent
      That is the issue. There's been a lot of posts about AI crawlers overloading websites. I only anticipate that getting worse as more and more services compete to be the best. And what better way to...

      They could still pretend to be regular users, though.

      That is the issue. There's been a lot of posts about AI crawlers overloading websites. I only anticipate that getting worse as more and more services compete to be the best. And what better way to be the best than to scrape content your competitors won't?

      7 votes