In the corners of the tech culture venues I follow I've noticed a lot of murmuring about Google generally just becoming less and less effective. Like, it's still very good at finding results about...
In the corners of the tech culture venues I follow I've noticed a lot of murmuring about Google generally just becoming less and less effective. Like, it's still very good at finding results about something current or topical, but its gotten much worse about anything more than a few months old. Even in cases where you can remember specific sentences from an article, it's just getting harder and harder to find it. This is a far cry from when it felt like I used to be able to roughly describe a famous picture and Google would usually return the picture.
Honestly I lay the blame partly with Google itself. Some of the technical improvements they've encouraged, like using HTTPS, have been for the better but for the longest time they've allowed the...
Honestly I lay the blame partly with Google itself. Some of the technical improvements they've encouraged, like using HTTPS, have been for the better but for the longest time they've allowed the SEO industry to game their algorithms in very overt ways, and they never took as hard a stance against at as they should have.
This article speaks to many of the issues I have with web-capitalism. I spend a lot of time editing Wikipedia and other wikis, which naturally demands that I be able to reference accessible online...
This article speaks to many of the issues I have with web-capitalism. I spend a lot of time editing Wikipedia and other wikis, which naturally demands that I be able to reference accessible online material. So it is extremely frustrating when a site will obliterate its archives for effectively no reason beyond SEO, or laziness. I would not be able to do the work that I do without the use of the Wayback Machine and other archival websites. They are an incredible resource that I recommend everyone take advantage of.
"Content pruning" is an effective SEO tactic on large, established websites. Rather that archiving old content with historical significance, many websites will delete it from their servers and return a 410 status code. Gone. The goal is to optimize "crawl budget," keeping Google focused on the content that matters now. The result is a web without institutional memory or accountability.
I'm not convinced that deleting old content from one's website actually helps SEO in a meaningful way. At the very least, "optimizing the crawl budget" does not necessitate the deletion of content. I personally think that this term is misleading, placing too much emphasis on the number of pages that can be indexed "within the budget," and not enough on the way that their content is presented. You're significantly better off cutting out all the useless JavaScript bloating your site and just using more internal links to connect all of your pages together. MediaWiki sites, for instance, have plenty of maintenance pages helping editors keep track of these things (example). If a search engine doesn't have to waste time and resources trying to meticulously navigate your site to begin with, it doesn't matter if you have a lot of pages, even if not all of them are necessarily topical.
The idea of using 410 for crawl budget reasons is silly. This is rarely a problem outside of misbehaving scripts (eg. calendar plugins that generate infinite pages). If you really have so many...
The idea of using 410 for crawl budget reasons is silly. This is rarely a problem outside of misbehaving scripts (eg. calendar plugins that generate infinite pages).
If you really have so many legitimate pages that you need to worry about crawl budget, then use the robots directives via meta tag or robots.txt instead. Using 410 is only going to hurt you.
I didn't read the article, I just want to make something other posters have speculated about entirely clear: There's no such thing as a crawl budget. Modern search engines want to crawl...
I didn't read the article, I just want to make something other posters have speculated about entirely clear: There's no such thing as a crawl budget.
Modern search engines want to crawl everything. They're only going to stop crawling if they determine they're stuck in a loop, or they've crawled past a certain depth, or there are strong signals of poor quality,
However there was, once upon a time, a crawler people called Google FreshBot that would hit certain pages more often, independent of normal indexing. There hasn't been any credible information about it for a long time but all signs suggest that aspects of the behavior still exist in some form. It's safe to assume that it still has the ability to identify frequently updated pages (like your front page new feed) and hit them (and the new pages they link to) more often.
So even if there was a budget, deleting archives would have no benefit, because the budget in question would be related to the freshness part of the crawl system, rather than the more methodical part that wants everything.
Bottom line is that if you're not getting crawled, you definitely have problems to solve, but all deleting things will accomplish is reduce the amount of unique content you have, which is essentially throwing away free traffic.
Nobody actually wants to go to these "optimized" sites, though, right? It must be possible to build a better search engine that can tell the difference between a useful web site and the top of...
Nobody actually wants to go to these "optimized" sites, though, right? It must be possible to build a better search engine that can tell the difference between a useful web site and the top of someone's sales funnel, seeing as any human who's been on the Internet for a while can do the same thing.
If you have a search engine, then you have search engine optimization. You need only look at the Yellow Pages for examples of that. An alphabetical listing results in companies naming themselves...
If you have a search engine, then you have search engine optimization. You need only look at the Yellow Pages for examples of that. An alphabetical listing results in companies naming themselves "AAA Towing" and such.
Thats debatable with Google at least. Its still possible to game Google's SEO rankings, but its algorithm has been switching more and more towards Google Brain black-boxy kind of stuff and less of...
Thats debatable with Google at least. Its still possible to game Google's SEO rankings, but its algorithm has been switching more and more towards Google Brain black-boxy kind of stuff and less of the deterministic statistical variety that PageRank and its ilk were. Im sure there will always be some statistical PageRank stuff in there, and thus it will be possible to game a little, but it'll get harder and harder as time goes on.
I’d think that their index would be so small as to be useless. The whole utility of a search engine is that it searches the whole web (or at least the non-dark-web that allows itself to be...
I’d think that their index would be so small as to be useless. The whole utility of a search engine is that it searches the whole web (or at least the non-dark-web that allows itself to be indexed). The web is decentralized by design—no authority is supposed to decide who is "listed" and who isn’t.
Sure, but noise to me may not be noise to you, and giving a central entity the power to make that determination is problematic (this is one reason not to like Google’s monopoly on search). There...
There is tons of noise on the web.
Sure, but noise to me may not be noise to you, and giving a central entity the power to make that determination is problematic (this is one reason not to like Google’s monopoly on search). There have been attempts to try to do this in a decentralized way like DMOZ and WOT etc. They all fell short because they were largely manually curated and couldn’t scale.
I'm not sure DMOZ really fell short. It just became outdated. It stopped making sense to find websites by category then alphabetical listing when you could simply ask for what you wanted.
I'm not sure DMOZ really fell short. It just became outdated. It stopped making sense to find websites by category then alphabetical listing when you could simply ask for what you wanted.
How do you determine and enforce this, other than checking the application for each site manually? Are you going to check each site on an ongoing basis to see if it changed ownership? This just...
Because unless you are a noncommercial entity (like a nonprofit organization or just a person who has made a website as a hobby pretty much) it would cost money like I mentioned above.
How do you determine and enforce this, other than checking the application for each site manually? Are you going to check each site on an ongoing basis to see if it changed ownership? This just sounds like a lot of work, and the fees for commercial sites would have to be enormous to pay for all that work.
No other service on the internet purports to police the rest of the web, though. It’s too big a job for a centralized entity to take on. Let’s say some startup in Tibet makes a website and applies...
No other service on the internet purports to police the rest of the web, though. It’s too big a job for a centralized entity to take on. Let’s say some startup in Tibet makes a website and applies to register. The CCP files a complaint saying they are a scam site. Does your legal department have anyone who reads Tibetan? Is your legal department going to fight the CCP every time they file a complaint against sites that they don’t like for political reasons? Replace the CCP with any other entity with corrupt intentions, and Tibetan with any other minority language. You’d basically just be paying huge sums of money to lawyers and consultants to get things right, or you’d end up with a broken system that nobody would use.
It’s not much different except that Google is doing it all automatically, not manually. And what do you do when bad actors effectively DDOS your reporting system with spurious claims? I’m not...
I don't understand how this would be any more "policing the web" than Google punishing certain websites for gaming their algorithm in various ways.
It’s not much different except that Google is doing it all automatically, not manually.
People would be reporting web sites for violating your terms of service.
And what do you do when bad actors effectively DDOS your reporting system with spurious claims?
This is all really quite sensible.
I’m not arguing it’s not sensible. I’m saying it’s not feasible.
How do you determine which to ignore? Absolutely not! Feasibility has to do with the realities of how difficult something is. Sensibility has to do with whether something makes sense. There are...
you ignore them.
How do you determine which to ignore?
And "feasible" and "sensible" mean almost the same exact thing.
Absolutely not! Feasibility has to do with the realities of how difficult something is. Sensibility has to do with whether something makes sense. There are plenty of sensible, infeasible things. E.g., curing cancer, sending humans to explore other planets, stopping pollution on earth, etc.
There are also plenty of nonsensical, feasible things. E.g., running into traffic, intentionally clogging a sink, eating feces, etc.
By your logic, internet journalism is a giant contradiction since anyone with a botnet (like people who don't want the truth to get out) can literally DDoS your website.
There are solutions to this like CloudFlare etc. But your proposed system needs to be able take in claims and classify them. Without throwing the proverbial baby out with the bath water, this will be as difficult as bad actors decide to make it. Compare to YouTube’s takedown claims system. It’s been largely automated, because it’s infeasible to manually check every claim or upload that gets automatically flagged. Even so, it’s very expensive, not particularly useful to users, and to my knowledge YouTube is basically still subsidized by Google.
Edit: By DDOS here I don’t mean deny service on a technical level. I just mean that bad actors will make so many spurious claims that you won’t be able to find the legitimate claims amongst all the noise.
Craigslist is nice, but it’s small and all manually driven. The web is several orders of magnitude larger in scale than Craigslist. People joke about “web-scale” in the technosphere, but it’s...
craigslist.org must blow your mind.
Craigslist is nice, but it’s small and all manually driven. The web is several orders of magnitude larger in scale than Craigslist. People joke about “web-scale” in the technosphere, but it’s actually a thing that few organizations are capable of dealing with.
In the corners of the tech culture venues I follow I've noticed a lot of murmuring about Google generally just becoming less and less effective. Like, it's still very good at finding results about something current or topical, but its gotten much worse about anything more than a few months old. Even in cases where you can remember specific sentences from an article, it's just getting harder and harder to find it. This is a far cry from when it felt like I used to be able to roughly describe a famous picture and Google would usually return the picture.
Honestly I lay the blame partly with Google itself. Some of the technical improvements they've encouraged, like using HTTPS, have been for the better but for the longest time they've allowed the SEO industry to game their algorithms in very overt ways, and they never took as hard a stance against at as they should have.
This article speaks to many of the issues I have with web-capitalism. I spend a lot of time editing Wikipedia and other wikis, which naturally demands that I be able to reference accessible online material. So it is extremely frustrating when a site will obliterate its archives for effectively no reason beyond SEO, or laziness. I would not be able to do the work that I do without the use of the Wayback Machine and other archival websites. They are an incredible resource that I recommend everyone take advantage of.
I'm not convinced that deleting old content from one's website actually helps SEO in a meaningful way. At the very least, "optimizing the crawl budget" does not necessitate the deletion of content. I personally think that this term is misleading, placing too much emphasis on the number of pages that can be indexed "within the budget," and not enough on the way that their content is presented. You're significantly better off cutting out all the useless JavaScript bloating your site and just using more internal links to connect all of your pages together. MediaWiki sites, for instance, have plenty of maintenance pages helping editors keep track of these things (example). If a search engine doesn't have to waste time and resources trying to meticulously navigate your site to begin with, it doesn't matter if you have a lot of pages, even if not all of them are necessarily topical.
The idea of using 410 for crawl budget reasons is silly. This is rarely a problem outside of misbehaving scripts (eg. calendar plugins that generate infinite pages).
If you really have so many legitimate pages that you need to worry about crawl budget, then use the robots directives via meta tag or robots.txt instead. Using 410 is only going to hurt you.
I didn't read the article, I just want to make something other posters have speculated about entirely clear: There's no such thing as a crawl budget.
Modern search engines want to crawl everything. They're only going to stop crawling if they determine they're stuck in a loop, or they've crawled past a certain depth, or there are strong signals of poor quality,
However there was, once upon a time, a crawler people called Google FreshBot that would hit certain pages more often, independent of normal indexing. There hasn't been any credible information about it for a long time but all signs suggest that aspects of the behavior still exist in some form. It's safe to assume that it still has the ability to identify frequently updated pages (like your front page new feed) and hit them (and the new pages they link to) more often.
So even if there was a budget, deleting archives would have no benefit, because the budget in question would be related to the freshness part of the crawl system, rather than the more methodical part that wants everything.
Bottom line is that if you're not getting crawled, you definitely have problems to solve, but all deleting things will accomplish is reduce the amount of unique content you have, which is essentially throwing away free traffic.
Nobody actually wants to go to these "optimized" sites, though, right? It must be possible to build a better search engine that can tell the difference between a useful web site and the top of someone's sales funnel, seeing as any human who's been on the Internet for a while can do the same thing.
If you have a search engine, then you have search engine optimization. You need only look at the Yellow Pages for examples of that. An alphabetical listing results in companies naming themselves "AAA Towing" and such.
Thats debatable with Google at least. Its still possible to game Google's SEO rankings, but its algorithm has been switching more and more towards Google Brain black-boxy kind of stuff and less of the deterministic statistical variety that PageRank and its ilk were. Im sure there will always be some statistical PageRank stuff in there, and thus it will be possible to game a little, but it'll get harder and harder as time goes on.
I’d think that their index would be so small as to be useless. The whole utility of a search engine is that it searches the whole web (or at least the non-dark-web that allows itself to be indexed). The web is decentralized by design—no authority is supposed to decide who is "listed" and who isn’t.
Sure, but noise to me may not be noise to you, and giving a central entity the power to make that determination is problematic (this is one reason not to like Google’s monopoly on search). There have been attempts to try to do this in a decentralized way like DMOZ and WOT etc. They all fell short because they were largely manually curated and couldn’t scale.
I'm not sure DMOZ really fell short. It just became outdated. It stopped making sense to find websites by category then alphabetical listing when you could simply ask for what you wanted.
It fell short because it couldn’t compete with Google (and Lycos, Yahoo!, ...). It makes total sense. It’s just less convenient.
How does opting in solve the problem? Why would any site owner not opt in?
How is this different than a centralized version of robots.txt?
How do you determine and enforce this, other than checking the application for each site manually? Are you going to check each site on an ongoing basis to see if it changed ownership? This just sounds like a lot of work, and the fees for commercial sites would have to be enormous to pay for all that work.
No other service on the internet purports to police the rest of the web, though. It’s too big a job for a centralized entity to take on. Let’s say some startup in Tibet makes a website and applies to register. The CCP files a complaint saying they are a scam site. Does your legal department have anyone who reads Tibetan? Is your legal department going to fight the CCP every time they file a complaint against sites that they don’t like for political reasons? Replace the CCP with any other entity with corrupt intentions, and Tibetan with any other minority language. You’d basically just be paying huge sums of money to lawyers and consultants to get things right, or you’d end up with a broken system that nobody would use.
It’s not much different except that Google is doing it all automatically, not manually.
And what do you do when bad actors effectively DDOS your reporting system with spurious claims?
I’m not arguing it’s not sensible. I’m saying it’s not feasible.
How do you determine which to ignore?
Absolutely not! Feasibility has to do with the realities of how difficult something is. Sensibility has to do with whether something makes sense. There are plenty of sensible, infeasible things. E.g., curing cancer, sending humans to explore other planets, stopping pollution on earth, etc.
There are also plenty of nonsensical, feasible things. E.g., running into traffic, intentionally clogging a sink, eating feces, etc.
There are solutions to this like CloudFlare etc. But your proposed system needs to be able take in claims and classify them. Without throwing the proverbial baby out with the bath water, this will be as difficult as bad actors decide to make it. Compare to YouTube’s takedown claims system. It’s been largely automated, because it’s infeasible to manually check every claim or upload that gets automatically flagged. Even so, it’s very expensive, not particularly useful to users, and to my knowledge YouTube is basically still subsidized by Google.
Edit: By DDOS here I don’t mean deny service on a technical level. I just mean that bad actors will make so many spurious claims that you won’t be able to find the legitimate claims amongst all the noise.
Craigslist is nice, but it’s small and all manually driven. The web is several orders of magnitude larger in scale than Craigslist. People joke about “web-scale” in the technosphere, but it’s actually a thing that few organizations are capable of dealing with.