10
votes
Spiders
Is anyone here familiar with crawling the web? I’m interested in broad crawling, rather than focusing on particular sites. I’d appreciate pretty much any information about how this is usually done, and things to watch out for if attempting it.
First, do you need to do the crawl at all? Commoncrawl has data from a large-scale crawl you can just download.
Disclaimer
As a disclaimer, I don't have personal experience with broad crawling, but I've done some research on it in the past, as well as small-scale scraping, so here's what I remember.
Process
Generally, you start with a list of seed URLs and put them in a queue. The queue can be a simple in-memory datastructure or held in a database. Larger deployments will shard this queue by domain or by worker. For example, you could use rendezvouz hashing to assign URLs to available workers.
Then:
Parsing is the hard part. You'll probably want to remove irrelevant page content (such as ads); doing this consistently in a broad crawl is hard though because no two websites are exactly alike. Alternatively, if you know what you're looking for from the start, extract that information only and discard everything else.
robots.txt
You'll probably want to fetch and parse
robots.txt
, which specifies whether the site allows you to crawl it at all.Unfortunately many sites only allow the big search engines and prevent anyone else from crawling; you can ignore this, but it's considered impolite.
Rate Limiting
You need to make sure not to hit a given website too hard (
robots.txt
sometimes specifies the limits here) or they may just block you. Amazon is especially good at this -- if you're not taking special precautions (proxy servers, etc.), they will ban you within minutes.Ranking
If you're doing a broad crawl, you need to do some kind of search/ranking setup. In the simple case you can just chuck things into ElasticSearch, but there are also more sophisticated approaches, like Google's PageRank. The latter can be reasonably well approximated by AOPIC.
Outro
I realize this is a bit scatterbrained; maybe I can go into more detail if you have specific questions.
In broad crawls, how do you deal with sites that are potentially infinite depth? I can imagine certain things like calendars would have links to ever increasing dates. Further, I think I could probably create a site to attack crawlers like this - but can’t think of a way to mitigate the attack if we’re just storing addresses in a database and later visiting them (always starting the crawl at the second level address would help some though). Do we need another spider dedicated to removing things like this from the database?
Yes, there are indeed sites that attack crawlers like that, these are sometimes known as http tarpits or spider traps.
The process I described in my comment above is simpler than what you would use in practice. For example, you would usually not crawl an URL if it has been crawled recently unless it is known to change frequently; in general, you'll want to compute how useful a given URL may be before actually crawling it.
This is a classical explore/exploit tradeoff that can be beautifully modeled as a multi-armed bandit problem. There was a great paper I read a few years go that described such a system in detail, but my Google-Fu is failing me right now. I'll edit this post if one of the hits it did spit out turns out to be relevant.
Judging whether to crawl an URL can include many different factors, such as how unique the content on that page is on average (too unique = random garbage, not unique = not worth indexing), how many pages total you already crawled on that domain, how many inbound links to that URL you already have in your database, whether the URL contains a valid date or contains a high-entropy string, and any other factors you can think of. If you're using AOPIC, you'll also have a natural estimate of page importance you can use.
Edit: I didn't find the multi-armed bandit paper, but I came across http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf , which has a nice in-depth introduction to web crawling.
Thanks, this helped a lot. I think I’ve settled on using dumb, polite crawlers and a dedicated content removal service - due to how I want to store my data, choosing what to remove should be easy enough.
I don't have experience with broad crawling, but I can imagine at least two strategies to deal with this:
One might consider deduplication to be something you perform as a post-process (e.g., how this paper describes post-processing Common Crawl), but I think attempting some simple deduplication as you go might be more optimal.