33 votes

Nepenthes: a tarpit intended to catch AI web crawlers

23 comments

  1. [4]
    post_below
    Link
    Huh? All the mainstream crawlers identify themselves pretty consistently and respect robots.txt. At least do the bare minimum and identify the main ones! Better yet, only punish crawlers that...

    There is not currently a way to differentiate between web crawlers that are indexing sites for search purposes, vs crawlers that are training AI models.

    Huh? All the mainstream crawlers identify themselves pretty consistently and respect robots.txt. At least do the bare minimum and identify the main ones! Better yet, only punish crawlers that don't respect robots.txt.

    10 votes
    1. [3]
      GOTO10
      Link Parent
      How? Anyone can claim to be the google crawler.

      identify the main ones

      How? Anyone can claim to be the google crawler.

      1. skybrian
        Link Parent
        Google makes it pretty easy to figure out if they're faking, though, because it publishes the IP addresses it crawls from.

        Google makes it pretty easy to figure out if they're faking, though, because it publishes the IP addresses it crawls from.

        11 votes
      2. post_below
        Link Parent
        Like I said, you really just need a robots.txt that disallows everything. Legit bots would pass by. For general info purposes though, there are good ways to tell if a bot claiming to be from a...

        Like I said, you really just need a robots.txt that disallows everything. Legit bots would pass by.

        For general info purposes though, there are good ways to tell if a bot claiming to be from a major search engine is an imposter. Some can't handle javascript, or handle it differently than the real crawlers. Some handle redirects differently. Some have different request rate limiting (or none) than you usually see from the big bots. Some don't respond to crawl-delay. Others don't respect robots.txt or implement it poorly (for example, if a bot requests a file in a disallowed honeypot subdirectory it's not googlebot). So the claim the author(s) made isn't entirely accurate.

        5 votes
  2. [3]
    Jordan117
    Link
    LLMs aren't the only use cases for web crawlers -- they're also essential for standard search engines, as well as projects like the Internet Archive. A more responsible person would target this...

    LLMs aren't the only use cases for web crawlers -- they're also essential for standard search engines, as well as projects like the Internet Archive. A more responsible person would target this specifically at the various chatbot spiders or at least allowlist known benign crawlers instead of leaving it wide open.

    7 votes
    1. qob
      Link Parent
      It's easy for an LLM crawler to identify itself as a search engine crawler or a normal browser. The web service has no way to tell the difference.

      It's easy for an LLM crawler to identify itself as a search engine crawler or a normal browser. The web service has no way to tell the difference.

      8 votes
    2. heraplem
      Link Parent
      I suspect that LLMs could represent the end of the open Web for this reason (among others).

      I suspect that LLMs could represent the end of the open Web for this reason (among others).

      5 votes
  3. [14]
    Comment deleted by author
    Link
    1. 0xSim
      Link Parent
      (...) Pick one or the other, but you can't say in the same post that you know LLMs have copyright and morality issues, and in the following paragraph, pretend that those who oppose them do it "for...

      Yes, I know, there are copyright and morality issues with corporations just taking everything they can get their hands on to train LLMs

      (...)

      Oh, no, new technology! Indiscriminately bad! Let's break it for no reason other than we don't like it!

      Pick one or the other, but you can't say in the same post that you know LLMs have copyright and morality issues, and in the following paragraph, pretend that those who oppose them do it "for no other reason that (they) don't like it".

      Edit:

      Not to mention that it's one thing to not want your own work included, and another thing entirely to try to "dismantle" or "poison" the entire project.

      I won't bother installing Nepenthes (or any other similar tool) on my website, but if the only business model of LLMs is selling stolen copyrighted content, it's truly not my problem if they poison themselves while stealing my content.

      17 votes
    2. [11]
      lynxy
      Link Parent
      Ironically, using the Luddites as an example- as a comparison to current day- is apt in almost the opposite way to the way I believe you meant it to be. IIRC, the Luddites weren't breaking new...

      Ironically, using the Luddites as an example- as a comparison to current day- is apt in almost the opposite way to the way I believe you meant it to be. IIRC, the Luddites weren't breaking new technology because they didn't understand it, or were scared of it, or even just because they "did not like it". They were tradesmen whose livelihoods were at risk due to increasing efforts to automate work- and the issues they had were not with the machines themselves, but with the increased financial inequality that was enabled by the use of them. They were fighting against the dehumanising effects of industrialisation, and I think one could argue that scraping human generated content en-masse in order to feed your LLM datasets, in order to generate more money without original creative input is similarly dehumanising and destructive.

      17 votes
      1. [10]
        sparksbet
        Link Parent
        And, in a point that's sorely neglected by most people who bring up the Luddites in a positive way, their approach didn't work. Destroying new technology to prevent the way capitalists use it to...

        They were tradesmen whose livelihoods were at risk due to increasing efforts to automate work- and the issues they had were not with the machines themselves, but with the increased financial inequality that was enabled by the use of them

        And, in a point that's sorely neglected by most people who bring up the Luddites in a positive way, their approach didn't work. Destroying new technology to prevent the way capitalists use it to harm the working class doesn't work. It has never worked and could never have worked. Anyone who actually cares about the plights of those put out of work or exploited by new technological developments should put their efforts towards developing actual class consciousness rather than ineffectually trying to harm capitalists by failing to destroy their tools.

        5 votes
        1. [2]
          lynxy
          Link Parent
          I mean absolutely no ill will with this question but I'd love you to elaborate on what you mean by "develop class consciousness"? My initial impression is that you believe that progress will only...

          I mean absolutely no ill will with this question but I'd love you to elaborate on what you mean by "develop class consciousness"? My initial impression is that you believe that progress will only be made as a whole, not through small / isolated groups performing particularly loud, potentially violent actions.

          6 votes
          1. sparksbet
            Link Parent
            That's part of it, yes, but also that they should look beyond merely the risks to their own jobs of a given technology and unite with other members of the working class to overthrow the system...

            That's part of it, yes, but also that they should look beyond merely the risks to their own jobs of a given technology and unite with other members of the working class to overthrow the system that oppresses them all.

        2. [7]
          JackA
          Link Parent
          Given the current state of the world I'm not sure "developing class consciousness" has a great track record either to be fair.

          Given the current state of the world I'm not sure "developing class consciousness" has a great track record either to be fair.

          6 votes
          1. boxer_dogs_dance
            Link Parent
            Attempts to form and impose top down socialist utopias have been disastrous. However, forming class consciousness is responsible for the forty hour week in the United States and current forms of...

            Attempts to form and impose top down socialist utopias have been disastrous. However, forming class consciousness is responsible for the forty hour week in the United States and current forms of social democracy in Europe.

            Labor rights history is fascinating.

            5 votes
          2. [5]
            sparksbet
            Link Parent
            It's got a better one than the Luddites.

            It's got a better one than the Luddites.

            2 votes
            1. [4]
              DawnPaladin
              Link Parent
              Does it, though? The Luddites destroyed a few machines. "Members of the working class overthrowing the system that oppresses them all" gave us communist Russia and China. If you're going to...

              Does it, though? The Luddites destroyed a few machines. "Members of the working class overthrowing the system that oppresses them all" gave us communist Russia and China. If you're going to advocate for socialist revolution, you need a really solid reason why this time it's not going to degenerate into authoritarianism.

              When you say "developing class consciousness", do you have a plan for that beyond posting on social media? It's easy to get into a habit of doing stuff that feels good but doesn't ultimately accomplish anything.

              Looking at the recent MAGA victory, it seems to me that an important element of their victory was that they went beyond filter-bubble social media and developed very popular broadcast media, like for example the Joe Rogan show. (I don't think Rogan self-identifies as MAGA, but I have heard he's been very effective at pushing the younger generation right.) Conservative billionaires are happy to pour money into conservative media because it creates a steady supply of marks, increasing their wealth and creating a positive feedback loop. It's much more difficult for the Left to fund such projects because creating wealth for rich donors is not a goal of leftist media. You could try crowdfunding it, but I predict such a project would be subject to leftist purity testing and infighting.

              I think you are right to criticize the Luddites for doing stuff that doesn't work. I am sympathetic to your goals of stopping oppression. But the "developing class consciousness" project has been going on for some time now and it doesn't seem to be working in the United States.

              2 votes
              1. [3]
                sparksbet
                (edited )
                Link Parent
                I'm not going to claim that every worker's revolution achieved its goals or that every state aligned with communism is flawless. But the flawed achievements of workers revolutions in the past...

                I'm not going to claim that every worker's revolution achieved its goals or that every state aligned with communism is flawless. But the flawed achievements of workers revolutions in the past still beat out the absolute uselessness of the Luddites. I'm not trying to mount a robust defense of communism here (and I'm not the best equipped person to do so), but it's inarguable that worker's revolutions accomplished some things and had concrete effects on the world that we still feel. The Soviet Union has plenty about it worth criticizing, but they also accomplished a lot when they existed. There are even still states out there that are aligned with communism, flawed though they may be. By contrast, you're almost certainly wearing something knit by more advanced versions of the technology the Luddites opposed, because they accomplished absolutely nothing. They did not even succeed at delaying the adoption of machine textiles. I think that attempts to oppose modern technology nowadays are similarly futile and fail to address the inequality that those technologies can exacerbate, and I bring up class consciousness because the people defending the Luddites in this way typically approach the issue from a workers' rights perspective.

                2 votes
                1. [2]
                  DawnPaladin
                  Link Parent
                  100% agree that trying to destroy or boycott new technologies are completely ineffective strategies. I also agree that AI has the potential to centralize power and increase inequality. That's why...

                  100% agree that trying to destroy or boycott new technologies are completely ineffective strategies.

                  I also agree that AI has the potential to centralize power and increase inequality. That's why I'm studying math in the evenings, working toward getting a degree in AI. If we want good outcomes from this technology, we're going to need people using this technology to produce good outcomes.

                  1. sparksbet
                    Link Parent
                    I think that's a good perspective! I work in AI (NLP specifically) and it's definitely an added struggle (but an important one!) to find places to work that are making the world better rather than...

                    I think that's a good perspective! I work in AI (NLP specifically) and it's definitely an added struggle (but an important one!) to find places to work that are making the world better rather than worse.

                    1 vote
    3. Lexinonymous
      Link Parent
      I would suggest an edit pass on your post. There are much better ways of presenting your argument than reducing opponents of AI into a ridiculous strawman, as well as making unfavorable...

      Oh, no, new technology! Indiscriminately bad! Let's break it for no reason other than we don't like it! It's giving luddites of the 19th century, and I think we can do better than that.

      I would suggest an edit pass on your post. There are much better ways of presenting your argument than reducing opponents of AI into a ridiculous strawman, as well as making unfavorable comparisons to a movement that is itself commonly strawmanned.

      12 votes
  4. [3]
    unkz
    Link
    This will have partial utility versus intelligent crawlers. They already know about this trick, and defeat it by hitting randomly generated URLs to see if they present content (because many sites...

    This will have partial utility versus intelligent crawlers. They already know about this trick, and defeat it by hitting randomly generated URLs to see if they present content (because many sites customize their 404 pages to return a 200 status code + random links for SEO purposes). Then they will largely ignore your site, which is maybe a good enough outcome -- the poisoning of training data will probably not work though.

    3 votes
    1. [2]
      Protected
      Link Parent
      I'm not sure I understand this. Can you explain in greater detail? Do LLM spiders not crawl locations for which randomly generated URLs don't generate content?

      hitting randomly generated URLs to see if they present content

      I'm not sure I understand this. Can you explain in greater detail? Do LLM spiders not crawl locations for which randomly generated URLs don't generate content?

      1 vote
      1. unkz
        Link Parent
        It drastically affects the crawling of a site. This system is identical to automatic keyword content farm, and crawlers want to avoid that kind of thing.

        It drastically affects the crawling of a site. This system is identical to automatic keyword content farm, and crawlers want to avoid that kind of thing.

        3 votes