I'm in IT. Although I do think LLMs taken on their own are fascinating, I want to emphasize that the article linked is not hyperbole. It really is that bad, and it's getting worse. If this...
I'm in IT. Although I do think LLMs taken on their own are fascinating, I want to emphasize that the article linked is not hyperbole. It really is that bad, and it's getting worse. If this scraping behavior keeps up (and I see no signs of anything other than acceleration) we're in for a sea change in how the Internet operates, in an almost completely negative way. I wish I had better news to share, but it really is that bad.
This is not at all surprising to me. If you make a public resource there are always going to be bad actors who will ruin them for everyone else. It's especially true if the resource is something...
This is not at all surprising to me. If you make a public resource there are always going to be bad actors who will ruin them for everyone else. It's especially true if the resource is something that can be made money on. It seems AI has made these resources a monied target, and so rather than a few greedy individuals, it's giant corporations.
I see this going the way of email where self-hosting becomes harder due to all the bad actors and the countermeasures to limit their damage that result in brownouts or blocking. Websites will...
I see this going the way of email where self-hosting becomes harder due to all the bad actors and the countermeasures to limit their damage that result in brownouts or blocking.
Websites will allow crawling by well-known, well-behaved crawlers from known IP addresses. Everyone else is treated with suspicion. It will be harder to self-host a crawler, which means you might need to use a service to do it for you.
For self-hosted websites you’ll at least need something like Cloudflare in front.
Eventually, captcha might become more strict and many websites harder to use without having an account somewhere vouching for you.
From the article, it sounds like this doesn't fully encapsulate the issue. The crawlers are using a vast number of IP addresses with a lot as part of residential IP blocks. The crawlers are trying...
Websites will allow crawling by well-known, well-behaved crawlers from known IP addresses. Everyone else is treated with suspicion. It will be harder to self-host a crawler, which means you might need to use a service to do it for you.
From the article, it sounds like this doesn't fully encapsulate the issue. The crawlers are using a vast number of IP addresses with a lot as part of residential IP blocks. The crawlers are trying to appear to be legitimate visitors, and are not hitting a ton of resources from the same IP blocks.
In many ways the change is long overdue. Security and rights management have always been an issue with networking and internet (http as an easy example), but the need for more has always...
In many ways the change is long overdue. Security and rights management have always been an issue with networking and internet (http as an easy example), but the need for more has always outstripped the "hold one how do we identify and control bad actors".
That's just always been "your problem" and then you put in some basic things and eventually pay a company like cloudflare so it's actually "their problem" because you can't handle when you accidentally make the front page of reddit or whatever and your site gets blown up by traffic and bad actors.
Now all sites are basically constantly in a state of "blown up" and I'm really wondering if we're going to see the can kicked into the court of someone like cloudflare again (not only stopping things like ddos but requiring some level of registration to access ANYTHING hosted by them to weed out, or profit, from bots?) or just a full shutdown of anything being accessible without some sort of paid account.
I'm sure we'll also see some level of more creative anti bot measures (i've seen a few new captchas that try to get obtuse and fuck with bots), but if it costs 0.01c (or more likely some fraction of that) the average user could have a metro card style account they load up every now and then and mass data harvesting LLMs would have to pay through the nose.
I dunno...i think there's solutions, but it does involve some sort of of "locks on our doors" approach to the web, and it's been living in a weird and unsustainable state of security and profitization for a long time now.
Edit-
Annnd having read slightly farther @teaearlgraycold already succinctly said most of this.
His conclusion "I will remember which side you picked when the bubble bursts" doesn't seem to align with the reality I see. I don't think generative AI is a fad which is about to go away. LLMs may...
His conclusion "I will remember which side you picked when the bubble bursts" doesn't seem to align with the reality I see.
I don't think generative AI is a fad which is about to go away. LLMs may be replaced by newer architectures but I suspect the data scraping problem isn't going to change if it does.
So I guesd I am trying to say that the problem he describes doesn't sound like it will pass soon. If anything I expect having an up to date snapshot of all public knowledge to feed to a model is going to increase in value as the applications of generative AI grow in capability and value.
I don't know the solution to the problem. Maybe having a fast cache for changes or publishing diffs so a full rescan isn't continually needed? This is a complex problem.
I think it won't easily be possible to successfully distinguish between AI and human intelligence on a webserver so maybe technical solutions to mitigate the problem could work.
I do think there is a bubble that will burst, but it's not LLMs in general. The bubble I think will burst is the massive amount of VC funding being shoveled into it. The reason I think that is...
I do think there is a bubble that will burst, but it's not LLMs in general. The bubble I think will burst is the massive amount of VC funding being shoveled into it. The reason I think that is that the companies training models and such are seemingly lighting money on fire while just praying that they'll stumble into a sustainable business model.
I suspect it just collapses down to the few players who can afford to really do it. Everyone winds up paying for GPT/CoPilot/whatever and those are your options. In the meantime every time someone...
I suspect it just collapses down to the few players who can afford to really do it. Everyone winds up paying for GPT/CoPilot/whatever and those are your options. In the meantime every time someone throws and if statement in their code it gets and AI tag and 40% price bump.
Agreed on this for sure. For the longest time the idea that "I COULD HAVE ALL THIS DATA" didn't matter because it was mostly fucking useless, or used in ways that most people didn't care about....
His conclusion "I will remember which side you picked when the bubble bursts" doesn't seem to align with the reality I see.
Agreed on this for sure. For the longest time the idea that "I COULD HAVE ALL THIS DATA" didn't matter because it was mostly fucking useless, or used in ways that most people didn't care about. Scraping reddit posts for hints on someone's credit worthiness wasn't worth the time or effort or whatever.
LLM's for better and worse are the first real thing that you can just keep throwing data at and in theory see returns (diminishing as hell imo, but it's there). The bubble isn't going to burst, but things are going to change.
The problem is the financial incentive to lie about the traffic. Reddit's API changes are a good example. If there's a free way for real humans to get the data, any company that wants to scrape...
The problem is the financial incentive to lie about the traffic. Reddit's API changes are a good example. If there's a free way for real humans to get the data, any company that wants to scrape the data for free has a financial incentive to try that first before considering any paid APIs.
Well it looks like my research contributions to transformer positional encoding have put me on his permanent blacklist. Possibly the demonstration I did of a video synthesis model trained purely...
If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.
Well it looks like my research contributions to transformer positional encoding have put me on his permanent blacklist. Possibly the demonstration I did of a video synthesis model trained purely on Creative Commons data, too, depending how broad that "et al" is.
I know it doesn't really matter in the scheme of things (I'll probably never cross paths with DeVault, outside the possibility of a chance encounter at a conference), and I honestly don't know whether that was genuinely intended to be a total, uncompromising statement of position or if it was a bit of deliberate hyperbole meaning "Meta, OpenAI, and other similar companies have serious ethical issues I object to", but either way it irritates me.
It's a flippant, overly broad, overly aggressive take on a nuanced situation and I can pretty much guarantee he wouldn't be as measured as this in his response if the tables were turned. It's almost an "I'm not mad, just disappointed" situation: he's more than capable of taking a reasoned stance on the balance between an incredibly powerful underlying technology that has huge potential benefits on the one hand, and the unmitigated assholery of many companies driving the field on the other. Bonus points for a tangent about the grey areas involved in maintaining one's own ethical integrity and where to draw lines when choosing a path to earn a living in a world dominated by amoral behemoths. That's an interesting and important conversation to have, but once again he's gone with two giant middle fingers to everyone involved instead.
Since he linked to his posts on the Go module mirror situation from near the top of the post, I'll drop in my own thoughts on how misleading he was in his representations of that at the time as well.
If I am being honest, looking also at your previous links this strongly gives the vibes of somewhat matching strong personalities and a case of the truth being somewhere in the middle. I honestly...
If I am being honest, looking also at your previous links this strongly gives the vibes of somewhat matching strong personalities and a case of the truth being somewhere in the middle. I honestly can see your points, at the same time your takes aren't exactly nuanced or measured either.
Is the way they communicate their frustration in these situations overly antagonistic, possibly. Does that invalidate their points entirely, I don't think so.
You're certainly not wrong! I was thinking after I wrote that about why this guy in particular gets under my skin so much, and the conclusion I came to was that I probably have a good amount in...
Exemplary
You're certainly not wrong! I was thinking after I wrote that about why this guy in particular gets under my skin so much, and the conclusion I came to was that I probably have a good amount in common with him.
I'll readily admit to strong opinions, not always expressed so well, although I do at least make a concerted effort to consider all possible sides of a complex issue - if there are points of nuance you think I'm missing I'd be genuinely interested to know, but on the rest I'll willingly say mea culpa.
Does that invalidate their points entirely, I don't think so.
Yeah, agreed - that's kind of what I was going for with "I'm not mad, just disappointed" in the comment above, and talking about wanting the same outcomes in that other thread. A decent amount of the time I'd like to consider myself on the same side as him, and more often than not I end up frustrated instead.
Not often that I see genuine self reflection in online conversations. Just a positive I wanted to acknowledge :) Well since you asked, taking hyperbole as truth and then adding to it is how I...
You're certainly not wrong! I was thinking after I wrote that about why this guy in particular gets under my skin so much, and the conclusion I came to was that I probably have a good amount in common with him.
Not often that I see genuine self reflection in online conversations. Just a positive I wanted to acknowledge :)
if there are points of nuance you think I'm missing
Well since you asked, taking hyperbole as truth and then adding to it is how I would put it. I get that it feels personal to you given your direct involvement, then again you are not the one creating LLMs by crawling the internet. Even if it wasn't outright stated by Drew it is clear that his ire is very much aimed at those.
Focussing so heavily on that and then just briefly on what should have been the focus, to me, comes across as doing effectively the same.
Meanwhile, it seems to me, that another way of approaching it would be to recognize that it actually isn't aimed at you. Possibly recognize that companies involved in this "unmitigated assholery" are poisoning the well in regard to the rest of the industry or at least people doing research in the field.
Because, yes, there are incredible potential benefits to be gained. But you also don't have to look far to see that a huge part of the current "AI" industry isn't moving towards that, contributing towards that goal and potentially harming future adaption.
This is some very worthwhile food for thought and I appreciate it, thank you for taking the time! It's tricky sometimes to see these things clearly from the inside looking out.
This is some very worthwhile food for thought and I appreciate it, thank you for taking the time! It's tricky sometimes to see these things clearly from the inside looking out.
I don’t think that he is trying to be as antagonistic as the literal interpretation of that statement reads, but he is also a fairly polarizing figure already so I’m not really going to spend the...
I don’t think that he is trying to be as antagonistic as the literal interpretation of that statement reads, but he is also a fairly polarizing figure already so I’m not really going to spend the time to defend his stance.
Hell, even tangentially involved. I'd make his list of mortal enemies on multiple counts: Working on developing adjecent tech, using Copilot , and legitimizing them. My research would probably...
That's an interesting and important conversation to have, but once again he's gone with two giant middle fingers to everyone involved instead.
Hell, even tangentially involved. I'd make his list of mortal enemies on multiple counts: Working on developing adjecent tech, using Copilot , and legitimizing them. My research would probably make most of the problems he's talking about a lot better (less data and energy hungry models), but apparently I'm still on the list.
I'm not sure how I feel about the body of the article, but the last two paragraphs are the only ones I have confidence in my assessment, and my assessment is poor.
Seems to me that it would mostly serve to exclude people less fortunate, maybe exclude a few of these scrapers and some others with oodles of VC money will just throw that in the mix. So, it would...
Seems to me that it would mostly serve to exclude people less fortunate, maybe exclude a few of these scrapers and some others with oodles of VC money will just throw that in the mix.
So, it would maybe solve the issue of cost for free services but at a degraded experience.
I've long thought that governments, who already have control over the economy in terms of currency, should have a means of digital currency exchange paid for by taxes. And this doesn't have to...
I've long thought that governments, who already have control over the economy in terms of currency, should have a means of digital currency exchange paid for by taxes. And this doesn't have to just be for buying access to API calls or things like that. For instance, I won't pay for a subscription to a news service because it's a lot of money to ask for when I just want to read one article. If I could just spend $0.25 or even a full dollar for access to an article, I would pay it rather than trying to find ways to bypass the paywall.
HTTP already has a "payment required" status code. It would be interesting if browsers could slap a friendly frontend on that, where you wouldn't have to deal with the page you're trying to view...
HTTP already has a "payment required" status code. It would be interesting if browsers could slap a friendly frontend on that, where you wouldn't have to deal with the page you're trying to view in terms of entering payment info. Maybe the browser just says "this page is $0.25, continue?" and that's it. Or you could even set it to automatically pay if the price is under a certain amount and show a little icon in the address bar (like the TLS lock).
And more BuzzFeed style lists where every item is on a new page charged separately. The extra cost for viewing first place will surprise you! Imagine it on a tutorial where every page is free...
And more BuzzFeed style lists where every item is on a new page charged separately. The extra cost for viewing first place will surprise you!
Imagine it on a tutorial where every page is free except the final step and that step is both difficult to guess and insanely expensive, but now you're stuck hours down this tutorial and really want to finish it.
I don't think that really works in this situation. The cart is a physical, scarce item. Information isn't like that. If people can get their money back, so can LLMs, and also so can people who...
I don't think that really works in this situation. The cart is a physical, scarce item. Information isn't like that. If people can get their money back, so can LLMs, and also so can people who genuinely did find the page helpful but just want their money back.
I think there would have to be some sort of exit barrier. As a terribly shitty example, a user pays money to access the page, and then gets their money back by completing captcha or something....
I think there would have to be some sort of exit barrier. As a terribly shitty example, a user pays money to access the page, and then gets their money back by completing captcha or something. Bots scraping your site are then paying you access it, but humans are free to get their money back and leave.
Of course that introduces different challenges/problems, but I'm just spitballing for the sake of discussion.
Still better than a paywall where you have to subscribe for a year to find out if the page is of interest to you. Also, think it through and reading the comments, the UI would definitely have to...
Still better than a paywall where you have to subscribe for a year to find out if the page is of interest to you.
Also, think it through and reading the comments, the UI would definitely have to ask you to approve, but maybe you could whitelist sites you trust.
Paying for things in to help prevent abuse of them seems like a good idea to me. I used to think this would be needed to stop email spam. If there was a higher cost per email then it would stop...
Paying for things in to help prevent abuse of them seems like a good idea to me. I used to think this would be needed to stop email spam. If there was a higher cost per email then it would stop the spam. But then other methods of reducing the spam to a reasonable level were implemented so charging per email sent never took off.
So there will probably be better countermeasures against mining and scraping eventually. I assume that there will be a technical arms race between the mining and scraping bots and internal bots/algorithms to keep them out.
I've used APIs to get data from bitbucket before and it already has measures to prevent a single host from pulling too much data. But this mostly blocks accidental and weak attempts, more serious actors must be spoofing the origin to get around it but I think there will be methods to detect that better.
I know the Interledger Foundation is making some effort in this area, like the Web Monetization API. But it's only a proposed standard at the moment, and I remember looking into this 5 years ago...
I know the Interledger Foundation is making some effort in this area, like the Web Monetization API. But it's only a proposed standard at the moment, and I remember looking into this 5 years ago and not sure how much has changed. But I guess both web and payments standards are slow moving areas.
At least there does seem to be some real money behind Interledger, they have grants for people working to implement their open standards, so that hopefully means it not just some lofty ideals without effort behind them, or some vaporware or similar. The futurre (though perhaps far) will tell!
It might be nice, but I don’t see that likely to happen when it didn’t for other web api’s? Maybe there will be another level of indirection: a web crawler service that many AI companies use...
It might be nice, but I don’t see that likely to happen when it didn’t for other web api’s?
Maybe there will be another level of indirection: a web crawler service that many AI companies use because it knows how to behave and therefore doesn’t get blocked?
Not really? You can do micropayments with modern banking. The only thing blockchain does is outsource the transaction ledger by saying "well this many people agree you got paid, so you did". This...
Not really?
You can do micropayments with modern banking. The only thing blockchain does is outsource the transaction ledger by saying "well this many people agree you got paid, so you did".
This could be useful if you have a shitton of transactions, which this could generate, but it might still just make more sense to have the same server farms companies like cloudflare run just log and handle the transactions.
Ironically, I don't think there's a coin out there that can actually handle anything near the volume you'd be dealing with in something like this. I believe SOL was making claims about it forever ago, but I know that's a very hit or miss project and a touchy one to boot. Bitcoin/ether certainly wouldn't make sense.
Anything's possible, but it's definitely a non-trivial problem at best. Having to have a consensus mechanism at all is always going to cause a slowdown. Maybe someone'll be able to get it within a...
Anything's possible, but it's definitely a non-trivial problem at best.
Having to have a consensus mechanism at all is always going to cause a slowdown. Maybe someone'll be able to get it within a few orders of magnitude of a non-blockchain database, but that's still not going to be enough to handle the kind of transaction volumes you'd need for modern mainstream payment system, much less something orders of magnitude larger still like micropayments for http access.
Where's the value-add? We don't need a public ledger of who paid for which sites, in fact, that's probably something we literally don't want (for privacy reasons). What seems more practical is...
Where's the value-add? We don't need a public ledger of who paid for which sites, in fact, that's probably something we literally don't want (for privacy reasons). What seems more practical is something like Venmo where they maintain their own ledger of who has how much money, so you avoid transaction fees and improve performance. Then you pay with the "fake" money that can be easily turned into "real" money.
Little update: Seems like Sourcehut has added Anubis to certain pages. Commit pages seem to trigger it for example: https://git.sr.ht/~bptato/chawan/commit/e1139d4 And Xe also launched a dedicated...
Little update: Seems like Sourcehut has added Anubis to certain pages. Commit pages seem to trigger it for example:
Techaro has "existed" for a fair bit now, but until Anubis its sole purpose has been as the archetypal "modern tech" company in xe's speculative fiction stories. I'm very amused that now they have...
Techaro has "existed" for a fair bit now, but until Anubis its sole purpose has been as the archetypal "modern tech" company in xe's speculative fiction stories. I'm very amused that now they have an "actual product".
Not in IT so understanding the terminology here is challenging at best, but what I could grasp points to an awful sounding problem that I'm betting my organization is having a hard time with, too....
Not in IT so understanding the terminology here is challenging at best, but what I could grasp points to an awful sounding problem that I'm betting my organization is having a hard time with, too.
It's also a reminder of just how long it's taking the broader public to realize how bad today's internet is for them. That's a values based judgement, of course, but I'm legitimately surprised how the flash and fun of it all continues to obscure the nastier truth behind today's web.
That sounds like a botnet of hacked computers and iot devices. No company owns such a set of IP addresses. I suspect there are a couple of layers to 'whitewash' this data. Tech giants or other...
...and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure...
That sounds like a botnet of hacked computers and iot devices. No company owns such a set of IP addresses. I suspect there are a couple of layers to 'whitewash' this data. Tech giants or other companies don't do the scraping themselves, they get it from a specialized company. And maybe that specialized company gets it from another. All the way down to criminal gangs of hackers that sell access to hacked consumer devices.
Our company gets this constantly, I'd estimate it is now over 60% of our over all traffic. All the standard bot detection tools haven't been able to detect most of it, which makes me think their...
Our company gets this constantly, I'd estimate it is now over 60% of our over all traffic. All the standard bot detection tools haven't been able to detect most of it, which makes me think their heuristics suck and they are relying heavily on user agent strings. Between this and the regular search engine crawlers, we're getting millions of requests a day that are generally useless.
The traffic stands out like a sore thumb. The traffic comes in bunches, usually groups of 5 or so residential IPs. Each IP will make requests with random user agents (likely really random,...
The traffic stands out like a sore thumb. The traffic comes in bunches, usually groups of 5 or so residential IPs.
Each IP will make requests with random user agents (likely really random, sometimes it's a handheld game console) making between 80-200 requests that are all over the place (like 120 pages into a search result, but none of the other) over the space of 5 minutes. The it's gone, rarely comes up again. It's the oddest, but definitely deliberate distributed scraping nonsense I've ever seen.
Why are people doing this? And why are they doing it like this? I understand that you want data to train your model. So some company training a big model, like Google or Meta or OpenAI, scrapes...
Why are people doing this? And why are they doing it like this?
I understand that you want data to train your model. So some company training a big model, like Google or Meta or OpenAI, scrapes everything. But these are big operations, they're not supposed to be likely to run around evading IP bans and ignoring robots.txt.
Who is in the intersection of "owns a jillion GPUs to train models" and "acts like an asshole script kiddie with a botnet"? Who is going to buy a crawl dataset from a cybercriminal and use it as a core part of their business proposition? Why don't they freakin' cooperate with each other like Common Crawl and not fetch the same data over and over again?
Or is this just cyber-colonialism coming home, where the software has run out of world to eat, and starts creating economic incentives to eat other software with the same rule-breaking disruptiveness?
I wish I knew the answers to this as well. One of the distributed scraper groups is from a few data centers owned by Tencent according to the IPs, they also do the random user-agent behavior, but...
I wish I knew the answers to this as well. One of the distributed scraper groups is from a few data centers owned by Tencent according to the IPs, they also do the random user-agent behavior, but at least they aren't residential IPs, though many of them are in different countries. It is so irritating, but we wouldn't even block them entirely if we could identify them, we'd simply be able to rate limit them to a sane amount of traffic so they couldn't momentarily crush of with more than 30 times our normal traffic that then vanishes minutes later.
I could see this being a pivot for the crypto mining farms. Perhaps they have made a judgement call that this will be more valuable in the long run. Or, they are using just a subset of their farm...
Who is in the intersection of "owns a jillion GPUs to train models" and "acts like an asshole script kiddie with a botnet"?
I could see this being a pivot for the crypto mining farms. Perhaps they have made a judgement call that this will be more valuable in the long run. Or, they are using just a subset of their farm for this.
And technically, data collection is a separate process from AI training. So it could be multiple parties at play.
If there are botnets of residential IPs out there scraping websites, there are almost assuredly botnets commenting/voting/retweeting etc. I know this is already obvious but another reason to not...
If there are botnets of residential IPs out there scraping websites, there are almost assuredly botnets commenting/voting/retweeting etc. I know this is already obvious but another reason to not bother engaging with larger social media sites other than via the content itself.
Yep, there most certainly are. And not just on remote servers utilizing residential IPs either. E.g. A phone bot farm in action p.s. ignore the video desciption, that's actually a MaxPhoneFarm...
Exemplary
there are almost assuredly botnets commenting/voting/retweeting etc.
Yep, there most certainly are. And not just on remote servers utilizing residential IPs either. E.g. A phone bot farm in action
Artificial intelligence (AI) needs a lot of data. That is collected by searching through every website and its sub-pages. Depending how this collection is done this can use a lot of computing...
Artificial intelligence (AI) needs a lot of data. That is collected by searching through every website and its sub-pages. Depending how this collection is done this can use a lot of computing resources and thereby incur a big cost for the entity hosting the websites.
There's an arms race between the websites and the data collectors to detect and evade detection respectively.
Not really. Those collectors are just super lazy. In this case, they could have performed regular git clone and get all the data with complete history efficiently and then ingest it locally. Once...
There's an arms race between the websites and the data collectors to detect and evade detection respectively.
Not really. Those collectors are just super lazy. In this case, they could have performed regular git clone and get all the data with complete history efficiently and then ingest it locally. Once in a while, they could git fetch to retrieve any updates. Again, super efficiently. Instead the collectors opted out for using the website to interpret the data for them, in the form of a webpage, which is brutally inefficient. The guy running the website is rightfully angry.
They're not lazy regarding evasion. They're just lazy regarding adapting their generic collection logic to specific sites to be less resource hungry, e.g. just use the repositories instead of the...
They're not lazy regarding evasion. They're just lazy regarding adapting their generic collection logic to specific sites to be less resource hungry, e.g. just use the repositories instead of the web interfaces of code forges.
Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.
[...] and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
What puzzles me is how exactly are they able to use end-user ranges. Perhaps they pay someone who has large install base to do the collection for them, but they only support simple HTTP GETs? Like...
What puzzles me is how exactly are they able to use end-user ranges. Perhaps they pay someone who has large install base to do the collection for them, but they only support simple HTTP GETs?
Like some shady antivirus software company or something like that.
I'm in IT. Although I do think LLMs taken on their own are fascinating, I want to emphasize that the article linked is not hyperbole. It really is that bad, and it's getting worse. If this scraping behavior keeps up (and I see no signs of anything other than acceleration) we're in for a sea change in how the Internet operates, in an almost completely negative way. I wish I had better news to share, but it really is that bad.
This is not at all surprising to me. If you make a public resource there are always going to be bad actors who will ruin them for everyone else. It's especially true if the resource is something that can be made money on. It seems AI has made these resources a monied target, and so rather than a few greedy individuals, it's giant corporations.
I see this going the way of email where self-hosting becomes harder due to all the bad actors and the countermeasures to limit their damage that result in brownouts or blocking.
Websites will allow crawling by well-known, well-behaved crawlers from known IP addresses. Everyone else is treated with suspicion. It will be harder to self-host a crawler, which means you might need to use a service to do it for you.
For self-hosted websites you’ll at least need something like Cloudflare in front.
Eventually, captcha might become more strict and many websites harder to use without having an account somewhere vouching for you.
A lot of this already exists.
From the article, it sounds like this doesn't fully encapsulate the issue. The crawlers are using a vast number of IP addresses with a lot as part of residential IP blocks. The crawlers are trying to appear to be legitimate visitors, and are not hitting a ton of resources from the same IP blocks.
In many ways the change is long overdue. Security and rights management have always been an issue with networking and internet (http as an easy example), but the need for more has always outstripped the "hold one how do we identify and control bad actors".
That's just always been "your problem" and then you put in some basic things and eventually pay a company like cloudflare so it's actually "their problem" because you can't handle when you accidentally make the front page of reddit or whatever and your site gets blown up by traffic and bad actors.
Now all sites are basically constantly in a state of "blown up" and I'm really wondering if we're going to see the can kicked into the court of someone like cloudflare again (not only stopping things like ddos but requiring some level of registration to access ANYTHING hosted by them to weed out, or profit, from bots?) or just a full shutdown of anything being accessible without some sort of paid account.
I'm sure we'll also see some level of more creative anti bot measures (i've seen a few new captchas that try to get obtuse and fuck with bots), but if it costs 0.01c (or more likely some fraction of that) the average user could have a metro card style account they load up every now and then and mass data harvesting LLMs would have to pay through the nose.
I dunno...i think there's solutions, but it does involve some sort of of "locks on our doors" approach to the web, and it's been living in a weird and unsustainable state of security and profitization for a long time now.
Edit-
Annnd having read slightly farther @teaearlgraycold already succinctly said most of this.
His conclusion "I will remember which side you picked when the bubble bursts" doesn't seem to align with the reality I see.
I don't think generative AI is a fad which is about to go away. LLMs may be replaced by newer architectures but I suspect the data scraping problem isn't going to change if it does.
So I guesd I am trying to say that the problem he describes doesn't sound like it will pass soon. If anything I expect having an up to date snapshot of all public knowledge to feed to a model is going to increase in value as the applications of generative AI grow in capability and value.
I don't know the solution to the problem. Maybe having a fast cache for changes or publishing diffs so a full rescan isn't continually needed? This is a complex problem.
I think it won't easily be possible to successfully distinguish between AI and human intelligence on a webserver so maybe technical solutions to mitigate the problem could work.
I do think there is a bubble that will burst, but it's not LLMs in general. The bubble I think will burst is the massive amount of VC funding being shoveled into it. The reason I think that is that the companies training models and such are seemingly lighting money on fire while just praying that they'll stumble into a sustainable business model.
yeah, the dot-com bubble burst but we still have the Internet and online shopping. I suspect it'll be similar for LLMs.
I suspect it just collapses down to the few players who can afford to really do it. Everyone winds up paying for GPT/CoPilot/whatever and those are your options. In the meantime every time someone throws and if statement in their code it gets and AI tag and 40% price bump.
Agreed on this for sure. For the longest time the idea that "I COULD HAVE ALL THIS DATA" didn't matter because it was mostly fucking useless, or used in ways that most people didn't care about. Scraping reddit posts for hints on someone's credit worthiness wasn't worth the time or effort or whatever.
LLM's for better and worse are the first real thing that you can just keep throwing data at and in theory see returns (diminishing as hell imo, but it's there). The bubble isn't going to burst, but things are going to change.
we're talking git repos from what I understand, the diffs are available, it's just that the bots are not making use of them for whatever reason.
The problem is the financial incentive to lie about the traffic. Reddit's API changes are a good example. If there's a free way for real humans to get the data, any company that wants to scrape the data for free has a financial incentive to try that first before considering any paid APIs.
Well it looks like my research contributions to transformer positional encoding have put me on his permanent blacklist. Possibly the demonstration I did of a video synthesis model trained purely on Creative Commons data, too, depending how broad that "et al" is.
I know it doesn't really matter in the scheme of things (I'll probably never cross paths with DeVault, outside the possibility of a chance encounter at a conference), and I honestly don't know whether that was genuinely intended to be a total, uncompromising statement of position or if it was a bit of deliberate hyperbole meaning "Meta, OpenAI, and other similar companies have serious ethical issues I object to", but either way it irritates me.
It's a flippant, overly broad, overly aggressive take on a nuanced situation and I can pretty much guarantee he wouldn't be as measured as this in his response if the tables were turned. It's almost an "I'm not mad, just disappointed" situation: he's more than capable of taking a reasoned stance on the balance between an incredibly powerful underlying technology that has huge potential benefits on the one hand, and the unmitigated assholery of many companies driving the field on the other. Bonus points for a tangent about the grey areas involved in maintaining one's own ethical integrity and where to draw lines when choosing a path to earn a living in a world dominated by amoral behemoths. That's an interesting and important conversation to have, but once again he's gone with two giant middle fingers to everyone involved instead.
Since he linked to his posts on the Go module mirror situation from near the top of the post, I'll drop in my own thoughts on how misleading he was in his representations of that at the time as well.
If I am being honest, looking also at your previous links this strongly gives the vibes of somewhat matching strong personalities and a case of the truth being somewhere in the middle. I honestly can see your points, at the same time your takes aren't exactly nuanced or measured either.
Is the way they communicate their frustration in these situations overly antagonistic, possibly. Does that invalidate their points entirely, I don't think so.
You're certainly not wrong! I was thinking after I wrote that about why this guy in particular gets under my skin so much, and the conclusion I came to was that I probably have a good amount in common with him.
I'll readily admit to strong opinions, not always expressed so well, although I do at least make a concerted effort to consider all possible sides of a complex issue - if there are points of nuance you think I'm missing I'd be genuinely interested to know, but on the rest I'll willingly say mea culpa.
Yeah, agreed - that's kind of what I was going for with "I'm not mad, just disappointed" in the comment above, and talking about wanting the same outcomes in that other thread. A decent amount of the time I'd like to consider myself on the same side as him, and more often than not I end up frustrated instead.
Not often that I see genuine self reflection in online conversations. Just a positive I wanted to acknowledge :)
Well since you asked, taking hyperbole as truth and then adding to it is how I would put it. I get that it feels personal to you given your direct involvement, then again you are not the one creating LLMs by crawling the internet. Even if it wasn't outright stated by Drew it is clear that his ire is very much aimed at those.
Focussing so heavily on that and then just briefly on what should have been the focus, to me, comes across as doing effectively the same.
Meanwhile, it seems to me, that another way of approaching it would be to recognize that it actually isn't aimed at you. Possibly recognize that companies involved in this "unmitigated assholery" are poisoning the well in regard to the rest of the industry or at least people doing research in the field.
Because, yes, there are incredible potential benefits to be gained. But you also don't have to look far to see that a huge part of the current "AI" industry isn't moving towards that, contributing towards that goal and potentially harming future adaption.
This is some very worthwhile food for thought and I appreciate it, thank you for taking the time! It's tricky sometimes to see these things clearly from the inside looking out.
I don’t think that he is trying to be as antagonistic as the literal interpretation of that statement reads, but he is also a fairly polarizing figure already so I’m not really going to spend the time to defend his stance.
Hell, even tangentially involved. I'd make his list of mortal enemies on multiple counts: Working on developing adjecent tech, using Copilot , and legitimizing them. My research would probably make most of the problems he's talking about a lot better (less data and energy hungry models), but apparently I'm still on the list.
I'm not sure how I feel about the body of the article, but the last two paragraphs are the only ones I have confidence in my assessment, and my assessment is poor.
We might need a standard protocol for micropayments towards previously free online services.
Seems to me that it would mostly serve to exclude people less fortunate, maybe exclude a few of these scrapers and some others with oodles of VC money will just throw that in the mix.
So, it would maybe solve the issue of cost for free services but at a degraded experience.
I've long thought that governments, who already have control over the economy in terms of currency, should have a means of digital currency exchange paid for by taxes. And this doesn't have to just be for buying access to API calls or things like that. For instance, I won't pay for a subscription to a news service because it's a lot of money to ask for when I just want to read one article. If I could just spend $0.25 or even a full dollar for access to an article, I would pay it rather than trying to find ways to bypass the paywall.
HTTP already has a "payment required" status code. It would be interesting if browsers could slap a friendly frontend on that, where you wouldn't have to deal with the page you're trying to view in terms of entering payment info. Maybe the browser just says "this page is $0.25, continue?" and that's it. Or you could even set it to automatically pay if the price is under a certain amount and show a little icon in the address bar (like the TLS lock).
I can't wait to pay just to see if a page might contain what I'm looking for.
And more BuzzFeed style lists where every item is on a new page charged separately. The extra cost for viewing first place will surprise you!
Imagine it on a tutorial where every page is free except the final step and that step is both difficult to guess and insanely expensive, but now you're stuck hours down this tutorial and really want to finish it.
Maybe there could be a way to get your money back, like returning your cart at the grocery store.
I don't think that really works in this situation. The cart is a physical, scarce item. Information isn't like that. If people can get their money back, so can LLMs, and also so can people who genuinely did find the page helpful but just want their money back.
I think there would have to be some sort of exit barrier. As a terribly shitty example, a user pays money to access the page, and then gets their money back by completing captcha or something. Bots scraping your site are then paying you access it, but humans are free to get their money back and leave.
Of course that introduces different challenges/problems, but I'm just spitballing for the sake of discussion.
Why not just... have a captcha?
Hey, I said it was a terribly shitty example didn't I?
Edit: Plus, the site owner is financially compensated when bots access the site.
Still better than a paywall where you have to subscribe for a year to find out if the page is of interest to you.
Also, think it through and reading the comments, the UI would definitely have to ask you to approve, but maybe you could whitelist sites you trust.
Paying for things in to help prevent abuse of them seems like a good idea to me. I used to think this would be needed to stop email spam. If there was a higher cost per email then it would stop the spam. But then other methods of reducing the spam to a reasonable level were implemented so charging per email sent never took off.
So there will probably be better countermeasures against mining and scraping eventually. I assume that there will be a technical arms race between the mining and scraping bots and internal bots/algorithms to keep them out.
I've used APIs to get data from bitbucket before and it already has measures to prevent a single host from pulling too much data. But this mostly blocks accidental and weak attempts, more serious actors must be spoofing the origin to get around it but I think there will be methods to detect that better.
I know the Interledger Foundation is making some effort in this area, like the Web Monetization API. But it's only a proposed standard at the moment, and I remember looking into this 5 years ago and not sure how much has changed. But I guess both web and payments standards are slow moving areas.
At least there does seem to be some real money behind Interledger, they have grants for people working to implement their open standards, so that hopefully means it not just some lofty ideals without effort behind them, or some vaporware or similar. The futurre (though perhaps far) will tell!
It might be nice, but I don’t see that likely to happen when it didn’t for other web api’s?
Maybe there will be another level of indirection: a web crawler service that many AI companies use because it knows how to behave and therefore doesn’t get blocked?
And they might pay for access.
So like…blockchain?
Not really?
You can do micropayments with modern banking. The only thing blockchain does is outsource the transaction ledger by saying "well this many people agree you got paid, so you did".
This could be useful if you have a shitton of transactions, which this could generate, but it might still just make more sense to have the same server farms companies like cloudflare run just log and handle the transactions.
Ironically, I don't think there's a coin out there that can actually handle anything near the volume you'd be dealing with in something like this. I believe SOL was making claims about it forever ago, but I know that's a very hit or miss project and a touchy one to boot. Bitcoin/ether certainly wouldn't make sense.
Anything's possible, but it's definitely a non-trivial problem at best.
Having to have a consensus mechanism at all is always going to cause a slowdown. Maybe someone'll be able to get it within a few orders of magnitude of a non-blockchain database, but that's still not going to be enough to handle the kind of transaction volumes you'd need for modern mainstream payment system, much less something orders of magnitude larger still like micropayments for http access.
Where's the value-add? We don't need a public ledger of who paid for which sites, in fact, that's probably something we literally don't want (for privacy reasons). What seems more practical is something like Venmo where they maintain their own ledger of who has how much money, so you avoid transaction fees and improve performance. Then you pay with the "fake" money that can be easily turned into "real" money.
Little update: Seems like Sourcehut has added Anubis to certain pages. Commit pages seem to trigger it for example:
https://git.sr.ht/~bptato/chawan/commit/e1139d4
And Xe also launched a dedicated website and repo for Anubis under a company called "Techaro", described as "the anti-AI AI company based in Canada":
https://anubis.techaro.lol/
https://github.com/TecharoHQ/anubis
Techaro has "existed" for a fair bit now, but until Anubis its sole purpose has been as the archetypal "modern tech" company in xe's speculative fiction stories. I'm very amused that now they have an "actual product".
Not in IT so understanding the terminology here is challenging at best, but what I could grasp points to an awful sounding problem that I'm betting my organization is having a hard time with, too.
It's also a reminder of just how long it's taking the broader public to realize how bad today's internet is for them. That's a values based judgement, of course, but I'm legitimately surprised how the flash and fun of it all continues to obscure the nastier truth behind today's web.
That sounds like a botnet of hacked computers and iot devices. No company owns such a set of IP addresses. I suspect there are a couple of layers to 'whitewash' this data. Tech giants or other companies don't do the scraping themselves, they get it from a specialized company. And maybe that specialized company gets it from another. All the way down to criminal gangs of hackers that sell access to hacked consumer devices.
Our company gets this constantly, I'd estimate it is now over 60% of our over all traffic. All the standard bot detection tools haven't been able to detect most of it, which makes me think their heuristics suck and they are relying heavily on user agent strings. Between this and the regular search engine crawlers, we're getting millions of requests a day that are generally useless.
So, if it's so close to legitimate behavior, how are you sure it isn't legitimate?
The traffic stands out like a sore thumb. The traffic comes in bunches, usually groups of 5 or so residential IPs.
Each IP will make requests with random user agents (likely really random, sometimes it's a handheld game console) making between 80-200 requests that are all over the place (like 120 pages into a search result, but none of the other) over the space of 5 minutes. The it's gone, rarely comes up again. It's the oddest, but definitely deliberate distributed scraping nonsense I've ever seen.
Why are people doing this? And why are they doing it like this?
I understand that you want data to train your model. So some company training a big model, like Google or Meta or OpenAI, scrapes everything. But these are big operations, they're not supposed to be likely to run around evading IP bans and ignoring robots.txt.
Who is in the intersection of "owns a jillion GPUs to train models" and "acts like an asshole script kiddie with a botnet"? Who is going to buy a crawl dataset from a cybercriminal and use it as a core part of their business proposition? Why don't they freakin' cooperate with each other like Common Crawl and not fetch the same data over and over again?
Or is this just cyber-colonialism coming home, where the software has run out of world to eat, and starts creating economic incentives to eat other software with the same rule-breaking disruptiveness?
I wish I knew the answers to this as well. One of the distributed scraper groups is from a few data centers owned by Tencent according to the IPs, they also do the random user-agent behavior, but at least they aren't residential IPs, though many of them are in different countries. It is so irritating, but we wouldn't even block them entirely if we could identify them, we'd simply be able to rate limit them to a sane amount of traffic so they couldn't momentarily crush of with more than 30 times our normal traffic that then vanishes minutes later.
I could see this being a pivot for the crypto mining farms. Perhaps they have made a judgement call that this will be more valuable in the long run. Or, they are using just a subset of their farm for this.
And technically, data collection is a separate process from AI training. So it could be multiple parties at play.
Tech companies absolutely do scraping themselves and have for a while.
If there are botnets of residential IPs out there scraping websites, there are almost assuredly botnets commenting/voting/retweeting etc. I know this is already obvious but another reason to not bother engaging with larger social media sites other than via the content itself.
Yep, there most certainly are. And not just on remote servers utilizing residential IPs either. E.g.
A phone bot farm in action
p.s. ignore the video desciption, that's actually a MaxPhoneFarm located in Vietnam operated by MIN Software. Pics: https://maxphonefarm.com/du-an-trien-khai/
Hi I don't work in tech but this seems important/significant and I'd like to understand what's going on... Can someone please ELI5?
Artificial intelligence (AI) needs a lot of data. That is collected by searching through every website and its sub-pages. Depending how this collection is done this can use a lot of computing resources and thereby incur a big cost for the entity hosting the websites.
There's an arms race between the websites and the data collectors to detect and evade detection respectively.
Not really. Those collectors are just super lazy. In this case, they could have performed regular
git clone
and get all the data with complete history efficiently and then ingest it locally. Once in a while, they couldgit fetch
to retrieve any updates. Again, super efficiently. Instead the collectors opted out for using the website to interpret the data for them, in the form of a webpage, which is brutally inefficient. The guy running the website is rightfully angry.They're not lazy regarding evasion. They're just lazy regarding adapting their generic collection logic to specific sites to be less resource hungry, e.g. just use the repositories instead of the web interfaces of code forges.
(emphasis mine)
What puzzles me is how exactly are they able to use end-user ranges. Perhaps they pay someone who has large install base to do the collection for them, but they only support simple HTTP GETs?
Like some shady antivirus software company or something like that.