14 votes

Content Independence Day: No AI crawl without compensation!

4 comments

  1. [3]
    skybrian
    Link
    From the blog post: How do they measure this? Apparently there’s a new metric on Cloudflare’s dashboard. From the blog post explaining that: Back to the first blog post: It’s unclear which default...

    From the blog post:

    Google still copies creators’ content, but over the last 10 years, because of the changes to the UI of “search” it's gotten almost 10 times more difficult for a content creator to get the same volume of traffic. That means it's 10 times more difficult to generate value from ads, subscriptions, or the ego of knowing someone cares about what you created.

    And that's the good news. It’s even worse with today’s AI tools. With OpenAI, it's 750 times more difficult to get traffic than it was with the Google of old. With Anthropic, it's 30,000 times more difficult. The reason is simple: increasingly we aren't consuming originals, we're consuming derivatives.

    How do they measure this? Apparently there’s a new metric on Cloudflare’s dashboard. From the blog post explaining that:

    Visitors to Cloudflare Radar can now review how often a given AI model sends traffic to a site relative to how often it crawls that site. We are sharing this analysis with a broad audience so that site owners can have better information to help them make decisions about which AI bots to allow or block and so that users can understand how AI usage in aggregate impacts Internet traffic.

    As HTML pages are arguably the most valuable content for these crawlers, the ratios displayed are calculated by dividing the total number of requests from relevant user agents associated with a given search or AI platform where the response was of Content-type: text/html by the total number of requests for HTML content where the Referer header contained a hostname associated with a given search or AI platform.

    Back to the first blog post:

    That changes today, July 1, what we’re calling Content Independence Day. Cloudflare, along with a majority of the world's leading publishers and AI companies, is changing the default to block AI crawlers unless they pay creators for their content. That content is the fuel that powers AI engines, and so it's only fair that content creators are compensated directly for it.

    It’s unclear which default they changed. Maybe it’s only for new Cloudflare customers? Looking at this page, it seems to be opt-in?

    There are more trends in this blog post:

    Googlebot’s share rose from 30% to 50%, supporting search indexing, but potentially also having AI-related purposes (such as new AI Overviews in Google Search). And GoogleOther (the crawler introduced in 2023) also increased in crawling traffic, 14%. Other Google crawlers not in the top 20, like Googlebot-News, also grew significantly (+71% in requests). There’s a clear trend of growth in these Google-related web crawlers at a time when the company is investing heavily in combining AI with search.

    Also in the search category, Bingbot’s share (from Microsoft) declined slightly from 10% to 8.7% (-1.3 pp), though its raw requests still grew modestly by 2%.

    These trends show that web crawling is increasingly dominated by bots from Google and OpenAI, reflecting clear shifts over the course of a year. Google also appears to be adapting how it collects data to support both traditional search and AI-driven features.

    9 votes
    1. [2]
      balooga
      Link Parent
      That new metric link is interesting... from what I can glean from its explanation of how they identify AI traffic, they're mainly looking at user agent strings and Referer headers. Both of which...

      That new metric link is interesting... from what I can glean from its explanation of how they identify AI traffic, they're mainly looking at user agent strings and Referer headers. Both of which are completely optional and arbitrary. Those are about as meaningful as robots.txt files or Do Not Track headers, i.e., well-intentioned but unenforceable conventions.

      There's just too much money at stake for me to imagine any other outcome than an arms race where AI traffic becomes increasingly indistinguishable from real human activity.

      9 votes
      1. skybrian
        Link Parent
        I think that's too focused on a worst-case scenario. The incentives aren't all in one direction. Why is Cloudflare is investing all this work into blocking AI crawlers? Because they're paid by...

        I think that's too focused on a worst-case scenario. The incentives aren't all in one direction.

        Why is Cloudflare is investing all this work into blocking AI crawlers? Because they're paid by website owners. They sell services like this to their customers and they'll be more appealing if they work better. That's an incentive too. I think they're in a position to improve their bot detection if they need to?

        Also, Google has the incentive that real users following a link from Google's websites get a good experience, which means they will want to vouch for them. Which is why they provide captchas and they have incentive to improve them.

        (I suspect that a strong signal for their captchas is "is this person logged into a Google account that doesn't seem to be a bot" and that's why some people have more trouble with captchas than others.)

        3 votes