10 votes

Spiders

Posted December 2, 2021 by Wulfsta

Is anyone here familiar with crawling the web? I’m interested in broad crawling, rather than focusing on particular sites. I’d appreciate pretty much any information about how this is usually done, and things to watch out for if attempting it.

5 comments

[5]
DataWraith
December 2, 2021
Link
First, do you need to do the crawl at all? Commoncrawl has data from a large-scale crawl you can just download. Disclaimer As a disclaimer, I don't have personal experience with broad crawling,...
First, do you need to do the crawl at all? Commoncrawl has data from a large-scale crawl you can just download.

Disclaimer

As a disclaimer, I don't have personal experience with broad crawling, but I've done some research on it in the past, as well as small-scale scraping, so here's what I remember.

Process

Generally, you start with a list of seed URLs and put them in a queue. The queue can be a simple in-memory datastructure or held in a database. Larger deployments will shard this queue by domain or by worker. For example, you could use rendezvouz hashing to assign URLs to available workers.

Then:
1. Take URL off the queue
2. Check if the URL is ready (it could be unready because of rate limiting)
3. Request the URL
4. Do any parsing or error handling
5. Put any newly discovered URLs back into the queue
Parsing is the hard part. You'll probably want to remove irrelevant page content (such as ads); doing this consistently in a broad crawl is hard though because no two websites are exactly alike. Alternatively, if you know what you're looking for from the start, extract that information only and discard everything else.

robots.txt

You'll probably want to fetch and parse robots.txt, which specifies whether the site allows you to crawl it at all.
Unfortunately many sites only allow the big search engines and prevent anyone else from crawling; you can ignore this, but it's considered impolite.

Rate Limiting

You need to make sure not to hit a given website too hard (robots.txt sometimes specifies the limits here) or they may just block you. Amazon is especially good at this -- if you're not taking special precautions (proxy servers, etc.), they will ban you within minutes.

Ranking

If you're doing a broad crawl, you need to do some kind of search/ranking setup. In the simple case you can just chuck things into ElasticSearch, but there are also more sophisticated approaches, like Google's PageRank. The latter can be reasonably well approximated by AOPIC.

Outro

I realize this is a bit scatterbrained; maybe I can go into more detail if you have specific questions.
6 votes
1. [4]
  Wulfsta (OP)
  December 2, 2021
  Link Parent
  In broad crawls, how do you deal with sites that are potentially infinite depth? I can imagine certain things like calendars would have links to ever increasing dates. Further, I think I could...
  
  In broad crawls, how do you deal with sites that are potentially infinite depth? I can imagine certain things like calendars would have links to ever increasing dates. Further, I think I could probably create a site to attack crawlers like this - but can’t think of a way to mitigate the attack if we’re just storing addresses in a database and later visiting them (always starting the crawl at the second level address would help some though). Do we need another spider dedicated to removing things like this from the database?
  
  2 votes
  1. [2]
    DataWraith
    December 2, 2021 (edited December 2, 2021)
    Link Parent
    Yes, there are indeed sites that attack crawlers like that, these are sometimes known as http tarpits or spider traps. The process I described in my comment above is simpler than what you would...
    
    Yes, there are indeed sites that attack crawlers like that, these are sometimes known as http tarpits or spider traps.
    
    The process I described in my comment above is simpler than what you would use in practice. For example, you would usually not crawl an URL if it has been crawled recently unless it is known to change frequently; in general, you'll want to compute how useful a given URL may be before actually crawling it.
    
    This is a classical explore/exploit tradeoff that can be beautifully modeled as a multi-armed bandit problem. There was a great paper I read a few years go that described such a system in detail, but my Google-Fu is failing me right now. I'll edit this post if one of the hits it did spit out turns out to be relevant.
    
    Judging whether to crawl an URL can include many different factors, such as how unique the content on that page is on average (too unique = random garbage, not unique = not worth indexing), how many pages total you already crawled on that domain, how many inbound links to that URL you already have in your database, whether the URL contains a valid date or contains a high-entropy string, and any other factors you can think of. If you're using AOPIC, you'll also have a natural estimate of page importance you can use.
    
    Edit: I didn't find the multi-armed bandit paper, but I came across http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf , which has a nice in-depth introduction to web crawling.
    
    5 votes
    
    Wulfsta (OP)
    December 2, 2021
    Link Parent
    Thanks, this helped a lot. I think I’ve settled on using dumb, polite crawlers and a dedicated content removal service - due to how I want to store my data, choosing what to remove should be easy...
    
    Thanks, this helped a lot. I think I’ve settled on using dumb, polite crawlers and a dedicated content removal service - due to how I want to store my data, choosing what to remove should be easy enough.
    
    2 votes
  2. onyxleopard
    December 2, 2021
    Link Parent
    I don't have experience with broad crawling, but I can imagine at least two strategies to deal with this: Fuzzy content duplicate detection / deduplication (i.e., "I've already seen a page that is...
    
    I don't have experience with broad crawling, but I can imagine at least two strategies to deal with this:
    
    Fuzzy content duplicate detection / deduplication (i.e., "I've already seen a page that is exactly like this one, except the date is different, so skip/deprioritize this."—locality sensitive hashing (LSH), such as simhash, is a good strategy for this, and that is how Google used to do it)
    
    Limiting the crawl either by depth, breadth, or simply put an upper bound on the number of URLs to crawl within a domain (i.e., "I've already spent 5 minutes of actual crawl time (not counting rate limits) on this or hit a cap of 10k URLs, so skip/deprioritize this.")
    
    One might consider deduplication to be something you perform as a post-process (e.g., how this paper describes post-processing Common Crawl), but I think attempting some simple deduplication as you go might be more optimal.
    
    3 votes

Disclaimer

Process

robots.txt

Rate Limiting

Ranking

Outro