20 votes

Bot web traffic has overtaken human web traffic

I've been seeing this claim repeated across social media, blogs, and various online communities these days. However, I haven't yet found a discussion that digs into the evidence behind it or provides reliable sources.

Where can I learn more about this topic?

I'm increasingly skeptical of mainstream media coverage and a lot of what I encounter online, so I'm looking for sources that are as rigorous and unbiased as possible. I'd especially appreciate:

  • Academic papers and research studies
  • Industry reports with transparent methodologies
  • Independent analyses that critically examine the claim
  • Any insights from people who work in web infrastructure, cybersecurity, search, analytics, or related fields

If you know of high-quality resources, I'd love to read about them.

13 comments

  1. [3]
    post_below
    Link
    I don't have any links for you, but a clarification: Bot traffic very likely first started outpacing human traffic at least a decade ago. Scrapers, search engine spiders, bots sniffying for common...

    I don't have any links for you, but a clarification: Bot traffic very likely first started outpacing human traffic at least a decade ago. Scrapers, search engine spiders, bots sniffying for common vulnerabilities, social media link info grabbers, even SMS and messaging apps send bot hits. A single bot instance can hit multiple pages a second for tiny fractions of a penny in electricity costs, humans never stood a chance.

    So bot traffic volumes versus human aren't really that interesting, bots were always going to have higher volume. The interesting question is about content that people are actually consuming/interacting with. How much of that is automated?

    Almost definitely bots are already ahead of humans on content creation volume, but traditionally the majority of that is SEO spam that people only interact with accidentally and rarely stay long. Humans are still way ahead when it comes to content that people actually engage with.

    Now with LLMs there's a real possibility that could change, but it hasn't happened yet.

    19 votes
    1. [2]
      ToteRose
      Link Parent
      I assumed as much, my main worry is about the content and/or commenting accounts. Video is still kind of manageable to spot when it's AI generated, but how do you manage to identify if a text is...

      I assumed as much, my main worry is about the content and/or commenting accounts. Video is still kind of manageable to spot when it's AI generated, but how do you manage to identify if a text is written by a bot or a human?

      I failed to mention on the original post that this came to me from a silly meme I read where someone pointed out how they're starting to find less and less AI "slop", which at firsts sounds positive because you think there's a decrease on AI content and then you realise maybe the fact that you can't find it means they're just getting better at making realistic content? Is there any way to know estimates of how much content nowadays is made by real humans and which one isn't?

      1 vote
      1. post_below
        Link Parent
        At the moment the majority of AI content is pretty easy to spot once you've seen the patterns enough times. And hard to spot until you have. Here's a good place to start:...

        At the moment the majority of AI content is pretty easy to spot once you've seen the patterns enough times. And hard to spot until you have.

        Here's a good place to start:
        https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

        I haven't seen any credible estimates of the percentage of content that's AI generated. It can really only be a wild estimate at this point since there aren't any perfect detection tools.

        The volume of LLM posts and articles is really high though, I come across multiple examples every time I visit a big aggregator like HN.

        3 votes
  2. [9]
    DeaconBlue
    Link
    Just as anecdata, my little blog running on a raspberry pi on my desk gets between 8000 and 20,000 hits a day from bots. I get maybe half a dozen hits from actual people. Depending how you want to...

    Just as anecdata, my little blog running on a raspberry pi on my desk gets between 8000 and 20,000 hits a day from bots.

    I get maybe half a dozen hits from actual people.

    Depending how you want to count "bot" traffic, I would be alarmed if the scales didn't tip over a decade ago.

    13 votes
    1. [8]
      post_below
      Link Parent
      To add to this, if you put a page/site online today that isn't linked anywhere on the internet (thus effectively invisible) and you just put a security certificate on it, within days you'll be...

      To add to this, if you put a page/site online today that isn't linked anywhere on the internet (thus effectively invisible) and you just put a security certificate on it, within days you'll be getting 100s of bot hits/day. Once it's indexed, multiply that by 10, double it if there's a contact form or comment boxes, triple it if it's running software that's a popular target (like wordpress), multiply by 10 again if it becomes a reasonably popular social site with user generated content.1

      1 napkin math based on real world sites

      7 votes
      1. [2]
        DeaconBlue
        Link Parent
        It absolutely does not have to be running WordPress. My site isn't running WordPress and bots sure do try hitting all of the standard WordPress endpoints.

        It absolutely does not have to be running WordPress. My site isn't running WordPress and bots sure do try hitting all of the standard WordPress endpoints.

        5 votes
        1. post_below
          Link Parent
          True, then once they find a WP endpoint and add the site to a list that gets passed around, they start coming more often.

          True, then once they find a WP endpoint and add the site to a list that gets passed around, they start coming more often.

          4 votes
      2. [5]
        lostwax
        Link Parent
        This isn't true in my very limited experience. I run a couple of services like nextcloud for my own use, with two logged in users and perhaps 50 people that have been emailed links to documents on...

        This isn't true in my very limited experience. I run a couple of services like nextcloud for my own use, with two logged in users and perhaps 50 people that have been emailed links to documents on the server. It's been up for years on the public net and I see very little in the logs that isn't me. That wasn't true until I changed the SSH port to something that isn't port 22 but the nextcloud login page seems to get left alone.

        1 vote
        1. post_below
          Link Parent
          That's great, may it never change! If the login is a publicly accessible page: to be safe, don't get a public certificate for it or visit it in chrome. Edge either.

          That's great, may it never change! If the login is a publicly accessible page: to be safe, don't get a public certificate for it or visit it in chrome. Edge either.

          1 vote
        2. [3]
          TaylorSwiftsPickles
          Link Parent
          It's sort of "security by obscurity" that way. Which isn't exactly a solid way to do security, but it does weed out a bunch of the dumb bots only scanning for very low-hanging fruit. These tend to...

          That wasn't true until I changed the SSH port to something that isn't port 22

          It's sort of "security by obscurity" that way. Which isn't exactly a solid way to do security, but it does weed out a bunch of the dumb bots only scanning for very low-hanging fruit. These tend to have pre-scripted subdomains and ports they'll check, and the more your own setup differs from those, the fewer things you'll get until they eventually figure it out. E.g. if you use a wildcard certificate they won't know which subdomains you actually use, but then again, wildcard certs have their own problems they can cause when misconfigured. Of course, this will do nothing for a more advanced or more targeted attack, but that's also not something you have to worry a lot about when it comes to projects like yours.

          1 vote
          1. [2]
            lostwax
            Link Parent
            Changing the port stopped my logs filling up with rubbish, which was a large proportion of what I was aiming at. I'm well aware a properly targeted attack would likely get in but I'm not sure what...

            Changing the port stopped my logs filling up with rubbish, which was a large proportion of what I was aiming at.

            I'm well aware a properly targeted attack would likely get in but I'm not sure what anyone would do with a lot of cold backed up construction documentation. My threat model largely consists of why would anyone bother. I make an effort to keep things up to date enough to deal with those that aren't really bothering

            1. TaylorSwiftsPickles
              Link Parent
              Yeah, the chances are fairly low, albeit non-zero, as I stated above. Myself, I've seen my fair share of attacks on "relatively unimportant websites", but those were generally either...

              My threat model largely consists of why would anyone bother

              Yeah, the chances are fairly low, albeit non-zero, as I stated above.

              Myself, I've seen my fair share of attacks on "relatively unimportant websites", but those were generally either geopolitically-motivated attacks by hacktivists or failed attempts to perform supply chain attacks towards another entity. But even then, they used known vulnerabilities in unpatched software, every time.

  3. overbyte
    Link
    The quotes likely come from a combination of Cloudflare's live dashboard on bot traffic when it tipped over and Imperva's Bad Bot Report. Cloudflare also have a glossary on what the terms on the...

    The quotes likely come from a combination of Cloudflare's live dashboard on bot traffic when it tipped over and Imperva's Bad Bot Report.

    Cloudflare also have a glossary on what the terms on the page mean. Given their position on the internet as the front door of many online services, they have unique insights on this metric (and one of the reasons why they bought 1.1.1.1 which also doubles as a large-scale research endpoint).

    So if that was me, I'd start at the companies like Cloudflare bearing the brunt of the problem and mitigating it head-on. Other heavyweights in the field have similar press releases and the usual reports seguing into selling their service, but it is a needed service. As a starting mix:

    • The OWASP Automated Threats to Web Applications
    • F5 Networks case study on an airline fighting fare scrapers.
    • Akamai is the king of live events and manages an equally massive CDN before Cloudflare came along.
    • Imperva, a popular security product deployed by many enterprises like banks, and the report combined with the Cloudflare dashboard is where most of the soundbites originate from. This is the 2025 version if you don't want to put up with the sales form.
    • Fastly, another popular CDN
    3 votes