- ~tech

[2]

NaraVara

April 6, 2020

Link

In the corners of the tech culture venues I follow I've noticed a lot of murmuring about Google generally just becoming less and less effective. Like, it's still very good at finding results about...

In the corners of the tech culture venues I follow I've noticed a lot of murmuring about Google generally just becoming less and less effective. Like, it's still very good at finding results about something current or topical, but its gotten much worse about anything more than a few months old. Even in cases where you can remember specific sentences from an article, it's just getting harder and harder to find it. This is a far cry from when it felt like I used to be able to roughly describe a famous picture and Google would usually return the picture.

20 votes

Death
April 7, 2020
Link Parent
Honestly I lay the blame partly with Google itself. Some of the technical improvements they've encouraged, like using HTTPS, have been for the better but for the longest time they've allowed the...

Honestly I lay the blame partly with Google itself. Some of the technical improvements they've encouraged, like using HTTPS, have been for the better but for the longest time they've allowed the SEO industry to game their algorithms in very overt ways, and they never took as hard a stance against at as they should have.

5 votes

[2]

Atvelonis

April 6, 2020 (edited April 7, 2020)

Link

This article speaks to many of the issues I have with web-capitalism. I spend a lot of time editing Wikipedia and other wikis, which naturally demands that I be able to reference accessible online...

This article speaks to many of the issues I have with web-capitalism. I spend a lot of time editing Wikipedia and other wikis, which naturally demands that I be able to reference accessible online material. So it is extremely frustrating when a site will obliterate its archives for effectively no reason beyond SEO, or laziness. I would not be able to do the work that I do without the use of the Wayback Machine and other archival websites. They are an incredible resource that I recommend everyone take advantage of.

"Content pruning" is an effective SEO tactic on large, established websites. Rather that archiving old content with historical significance, many websites will delete it from their servers and return a 410 status code. Gone. The goal is to optimize "crawl budget," keeping Google focused on the content that matters now. The result is a web without institutional memory or accountability.

I'm not convinced that deleting old content from one's website actually helps SEO in a meaningful way. At the very least, "optimizing the crawl budget" does not necessitate the deletion of content. I personally think that this term is misleading, placing too much emphasis on the number of pages that can be indexed "within the budget," and not enough on the way that their content is presented. You're significantly better off cutting out all the useless JavaScript bloating your site and just using more internal links to connect all of your pages together. MediaWiki sites, for instance, have plenty of maintenance pages helping editors keep track of these things (example). If a search engine doesn't have to waste time and resources trying to meticulously navigate your site to begin with, it doesn't matter if you have a lot of pages, even if not all of them are necessarily topical.

15 votes

Wes
April 7, 2020
Link Parent
The idea of using 410 for crawl budget reasons is silly. This is rarely a problem outside of misbehaving scripts (eg. calendar plugins that generate infinite pages). If you really have so many...

The idea of using 410 for crawl budget reasons is silly. This is rarely a problem outside of misbehaving scripts (eg. calendar plugins that generate infinite pages).

If you really have so many legitimate pages that you need to worry about crawl budget, then use the robots directives via meta tag or robots.txt instead. Using 410 is only going to hurt you.

3 votes

post_below

April 7, 2020

Link

I didn't read the article, I just want to make something other posters have speculated about entirely clear: There's no such thing as a crawl budget. Modern search engines want to crawl...

I didn't read the article, I just want to make something other posters have speculated about entirely clear: There's no such thing as a crawl budget.

Modern search engines want to crawl everything. They're only going to stop crawling if they determine they're stuck in a loop, or they've crawled past a certain depth, or there are strong signals of poor quality,

However there was, once upon a time, a crawler people called Google FreshBot that would hit certain pages more often, independent of normal indexing. There hasn't been any credible information about it for a long time but all signs suggest that aspects of the behavior still exist in some form. It's safe to assume that it still has the ability to identify frequently updated pages (like your front page new feed) and hit them (and the new pages they link to) more often.

So even if there was a budget, deleting archives would have no benefit, because the budget in question would be related to the freshness part of the crawl system, rather than the more methodical part that wants everything.

Bottom line is that if you're not getting crawled, you definitely have problems to solve, but all deleting things will accomplish is reduce the amount of unique content you have, which is essentially throwing away free traffic.

8 votes

[22]

PendingKetchup

April 6, 2020

Link

Nobody actually wants to go to these "optimized" sites, though, right? It must be possible to build a better search engine that can tell the difference between a useful web site and the top of...

Nobody actually wants to go to these "optimized" sites, though, right? It must be possible to build a better search engine that can tell the difference between a useful web site and the top of someone's sales funnel, seeing as any human who's been on the Internet for a while can do the same thing.

4 votes

[21]
bloup
April 6, 2020
Link Parent
What would you think of a nonprofit organization that provides a search engine service, but you have to apply to be listed on the search engine. And commercial listings could cost money. And SEO...

What would you think of a nonprofit organization that provides a search engine service, but you have to apply to be listed on the search engine. And commercial listings could cost money. And SEO could basically just be explicitly prohibited. And no ads.

2 votes
1. [3]
  Wes
  April 7, 2020
  Link Parent
  If you have a search engine, then you have search engine optimization. You need only look at the Yellow Pages for examples of that. An alphabetical listing results in companies naming themselves...
  
  If you have a search engine, then you have search engine optimization. You need only look at the Yellow Pages for examples of that. An alphabetical listing results in companies naming themselves "AAA Towing" and such.
  
  9 votes
  1. sigma
    April 7, 2020
    Link Parent
    Thats debatable with Google at least. Its still possible to game Google's SEO rankings, but its algorithm has been switching more and more towards Google Brain black-boxy kind of stuff and less of...
    
    Thats debatable with Google at least. Its still possible to game Google's SEO rankings, but its algorithm has been switching more and more towards Google Brain black-boxy kind of stuff and less of the deterministic statistical variety that PageRank and its ilk were. Im sure there will always be some statistical PageRank stuff in there, and thus it will be possible to game a little, but it'll get harder and harder as time goes on.
    
    4 votes
  2. bloup
    April 7, 2020 (edited April 7, 2020)
    Link Parent
    Doesn't mean you couldn't ban search engine optimization. Google has literally punished people for trying to game results in certain ways. Remember this?...
    
    Doesn't mean you couldn't ban search engine optimization. Google has literally punished people for trying to game results in certain ways. Remember this? https://genius.com/Genius-founders-rap-genius-is-back-on-google-annotated
    
    I don't really think that "people are going to find a way to do SEO no matter what you do" is a compelling reason to not explicitly ban it. I mean, pitchers are always going to find a way to throw spit balls.
    
    1 vote
2. [17]
  onyxleopard
  April 7, 2020
  Link Parent
  I’d think that their index would be so small as to be useless. The whole utility of a search engine is that it searches the whole web (or at least the non-dark-web that allows itself to be...
  
  I’d think that their index would be so small as to be useless. The whole utility of a search engine is that it searches the whole web (or at least the non-dark-web that allows itself to be indexed). The web is decentralized by design—no authority is supposed to decide who is "listed" and who isn’t.
  1. [16]
    bloup
    April 7, 2020
    Link Parent
    There is tons of noise on the web. I am not saying that it should be hard to be listed on a search engine, only that you should have to take active measures to make it happen. Instead of...
    
    There is tons of noise on the web. I am not saying that it should be hard to be listed on a search engine, only that you should have to take active measures to make it happen. Instead of pointlessly indexing parked domains and stuff that have no robots.txt. It would literally just be a way to filter out scammers and noise.
    
    3 votes
    
    [15]
    onyxleopard
    April 7, 2020
    Link Parent
    Sure, but noise to me may not be noise to you, and giving a central entity the power to make that determination is problematic (this is one reason not to like Google’s monopoly on search). There...
    
    There is tons of noise on the web.
    
    Sure, but noise to me may not be noise to you, and giving a central entity the power to make that determination is problematic (this is one reason not to like Google’s monopoly on search). There have been attempts to try to do this in a decentralized way like DMOZ and WOT etc. They all fell short because they were largely manually curated and couldn’t scale.
    
    [2]
    Wes
    April 7, 2020
    Link Parent
    I'm not sure DMOZ really fell short. It just became outdated. It stopped making sense to find websites by category then alphabetical listing when you could simply ask for what you wanted.
    
    I'm not sure DMOZ really fell short. It just became outdated. It stopped making sense to find websites by category then alphabetical listing when you could simply ask for what you wanted.
    
    onyxleopard
    April 7, 2020
    Link Parent
    It fell short because it couldn’t compete with Google (and Lycos, Yahoo!, ...). It makes total sense. It’s just less convenient.
    
    It stopped making sense to find websites by category then alphabetical listing when you could simply ask for what you wanted.
    
    It fell short because it couldn’t compete with Google (and Lycos, Yahoo!, ...). It makes total sense. It’s just less convenient.
    
    [12]
    bloup
    April 8, 2020
    Link Parent
    Again, I'm not talking about "manually curating" anything. My idea is pretty much just an "opt-in" search engine instead of opt-out. Being listed would not make any guarantees about the site.
    
    Again, I'm not talking about "manually curating" anything. My idea is pretty much just an "opt-in" search engine instead of opt-out. Being listed would not make any guarantees about the site.
    
    [11]
    onyxleopard
    April 8, 2020
    Link Parent
    How does opting in solve the problem? Why would any site owner not opt in? How is this different than a centralized version of robots.txt?
    
    How does opting in solve the problem? Why would any site owner not opt in?
    
    How is this different than a centralized version of robots.txt?
    
    [10]
    bloup
    April 8, 2020
    Link Parent
    Because unless you are a noncommercial entity (like a nonprofit organization or just a person who has made a website as a hobby pretty much) it would cost money like I mentioned above. Nothing...
    
    Because unless you are a noncommercial entity (like a nonprofit organization or just a person who has made a website as a hobby pretty much) it would cost money like I mentioned above. Nothing prohibitively expensive, just enough to make it so it would be unproductive to try and make money off of scams and parked domains and the like.
    
    [9]
    onyxleopard
    April 8, 2020
    Link Parent
    How do you determine and enforce this, other than checking the application for each site manually? Are you going to check each site on an ongoing basis to see if it changed ownership? This just...
    
    Because unless you are a noncommercial entity (like a nonprofit organization or just a person who has made a website as a hobby pretty much) it would cost money like I mentioned above.
    
    How do you determine and enforce this, other than checking the application for each site manually? Are you going to check each site on an ongoing basis to see if it changed ownership? This just sounds like a lot of work, and the fees for commercial sites would have to be enormous to pay for all that work.
    
    [8]
    bloup
    April 8, 2020
    Link Parent
    A terms of service, volunteer reporting, and a legal department, like pretty much every other service on the internet.
    
    A terms of service, volunteer reporting, and a legal department, like pretty much every other service on the internet.
    
    [7]
    onyxleopard
    April 8, 2020
    Link Parent
    No other service on the internet purports to police the rest of the web, though. It’s too big a job for a centralized entity to take on. Let’s say some startup in Tibet makes a website and applies...
    
    No other service on the internet purports to police the rest of the web, though. It’s too big a job for a centralized entity to take on. Let’s say some startup in Tibet makes a website and applies to register. The CCP files a complaint saying they are a scam site. Does your legal department have anyone who reads Tibetan? Is your legal department going to fight the CCP every time they file a complaint against sites that they don’t like for political reasons? Replace the CCP with any other entity with corrupt intentions, and Tibetan with any other minority language. You’d basically just be paying huge sums of money to lawyers and consultants to get things right, or you’d end up with a broken system that nobody would use.
    
    [6]
    bloup
    April 8, 2020
    Link Parent
    I don't understand how this would be any more "policing the web" than Google punishing certain websites for gaming their algorithm in various ways. You also wouldn't have to "fight" the CCP....
    
    I don't understand how this would be any more "policing the web" than Google punishing certain websites for gaming their algorithm in various ways.
    
    You also wouldn't have to "fight" the CCP. People would be reporting web sites for violating your terms of service. You could just ignore every single complaint without issue if you wanted to (and in fact straight up ignoring suspected bad faith actors altogether is probably a smart move). You don't owe anyone anything, here. On the contrary, for each legitimate complaint you do find where there is someone clearly violating the agreed terms of usage and abusing the public service, you have grounds to file a lawsuit. This also does not just include scammers, but people trying to circumvent paying for a commercial listing.
    
    Like seriously, this would just be yet another public internet platform with some simple rules you have to follow if you want to use it. Only it would be run by public organization instead of a commercial business. We are just talking about indexing websites, here. This is all really quite sensible.
    
    [5]
    onyxleopard
    April 8, 2020
    Link Parent
    It’s not much different except that Google is doing it all automatically, not manually. And what do you do when bad actors effectively DDOS your reporting system with spurious claims? I’m not...
    
    I don't understand how this would be any more "policing the web" than Google punishing certain websites for gaming their algorithm in various ways.
    
    It’s not much different except that Google is doing it all automatically, not manually.
    
    People would be reporting web sites for violating your terms of service.
    
    And what do you do when bad actors effectively DDOS your reporting system with spurious claims?
    
    This is all really quite sensible.
    
    I’m not arguing it’s not sensible. I’m saying it’s not feasible.
    
    [4]
    bloup
    April 8, 2020 (edited April 8, 2020)
    Link Parent
    you ignore them. And "feasible" and "sensible" mean almost the same exact thing. By your logic, internet journalism is a giant contradiction since anyone with a botnet (like people who don't want...
    
    what do you do when bad actors effectively DDOS your reporting system with spurious claims?
    
    you ignore them.
    
    And "feasible" and "sensible" mean almost the same exact thing.
    
    By your logic, internet journalism is a giant contradiction since anyone with a botnet (like people who don't want the truth to get out) can literally DDoS your website.
    
    Also, I don't really understand why this organization couldn't just automate processes too, just like Google is. Like seriously, how did we get to a point where this theoretical organization is so successful that literal states are devoting resources to hassling it and influencing what it does, but also somehow does not have the resources to develop its own automated tools?
    
    [3]
    onyxleopard
    April 8, 2020 (edited April 8, 2020)
    Link Parent
    How do you determine which to ignore? Absolutely not! Feasibility has to do with the realities of how difficult something is. Sensibility has to do with whether something makes sense. There are...
    
    you ignore them.
    
    How do you determine which to ignore?
    
    And "feasible" and "sensible" mean almost the same exact thing.
    
    Absolutely not! Feasibility has to do with the realities of how difficult something is. Sensibility has to do with whether something makes sense. There are plenty of sensible, infeasible things. E.g., curing cancer, sending humans to explore other planets, stopping pollution on earth, etc.
    
    There are also plenty of nonsensical, feasible things. E.g., running into traffic, intentionally clogging a sink, eating feces, etc.
    
    By your logic, internet journalism is a giant contradiction since anyone with a botnet (like people who don't want the truth to get out) can literally DDoS your website.
    
    There are solutions to this like CloudFlare etc. But your proposed system needs to be able take in claims and classify them. Without throwing the proverbial baby out with the bath water, this will be as difficult as bad actors decide to make it. Compare to YouTube’s takedown claims system. It’s been largely automated, because it’s infeasible to manually check every claim or upload that gets automatically flagged. Even so, it’s very expensive, not particularly useful to users, and to my knowledge YouTube is basically still subsidized by Google.
    
    Edit: By DDOS here I don’t mean deny service on a technical level. I just mean that bad actors will make so many spurious claims that you won’t be able to find the legitimate claims amongst all the noise.
    
    [2]
    bloup
    April 8, 2020
    Link Parent
    At this point I just think you have absolutely no imagination if you seriously think I am proposing something fantastic here. craigslist.org must blow your mind.
    
    At this point I just think you have absolutely no imagination if you seriously think I am proposing something fantastic here. craigslist.org must blow your mind.
    
    onyxleopard
    April 9, 2020
    Link Parent
    Craigslist is nice, but it’s small and all manually driven. The web is several orders of magnitude larger in scale than Craigslist. People joke about “web-scale” in the technosphere, but it’s...
    
    craigslist.org must blow your mind.
    
    Craigslist is nice, but it’s small and all manually driven. The web is several orders of magnitude larger in scale than Craigslist. People joke about “web-scale” in the technosphere, but it’s actually a thing that few organizations are capable of dealing with.