12
votes
Batch-saving websites for offline viewing
Anybody here have a good setup for batch-downloading articles/news from several sites you specify, similar to youtube-dl but for general websites? I'm sure it could be scripted with not too much effort but I'm interested what polished solutions there are.
The idea would be so people with rare internet access could go to a hotspot weekly or something and sync that week's worth of content.
I'm not sure if it's useful for batch tasks but I believe HTTrack is highly regarded.
I'll vouch for HTTrack.
Used it to download the entire Stardew Valley wiki onto my laptop for my last deployment in the Navy. Worked like a charm, much farming was done.
I prefer wget over curl for this.
I made a small script that you pass a URL and the script downloads everything below that URL recursively without crawling the parent levels. Relevant part of that script is:
I wanted to poke through a website when I was on vacation some 8 or so years ago and there was an application for windows that downloaded everything at the time.
That said, websites just referenced code files and linked to picture resources back then - there wasn't all sorts of CMS database stuff that required you to redirect paths to suit the new device's folder setup.
I'm probably wrong, but that's my busy guess on why your can't download entire sites for offline viewing. They'd probably need to be packaged for you by the site admin.
I actually started to work on this project just this week - I stumbled across the How to build a solar-powered website article on HackerNews a week ago, read a few other articles, saw a mention of RuralCafe, grabbed the paper and I'm now trying to hash out the structure of the project.
TL;DR of the paper: India, 2008, crappy intermittent internet connection, 256 kbps shared between hundreds of students at universities, RuralCafe got developed to address this. It has two components, local and remote proxy. The local proxy gets the searches and batches them so they can be communicated - via sneakernet if necessary - to the remote proxy. It then makes a search, prefetching the linked pages and possibly media as needed, and delivers it back.
My idea, roughly, is to build a remote proxy that grabs search results out of SearX, fetching the linked pages through either site-specific scrapers, using APIs when available, or a generic readability script. There's more I want to do, like extracting summaries - I spent this whole morning getting distracted by it - and applying lossy compression, but this is the gist of it.
My use cases:
See use cases for
wget