12 votes

Batch-saving websites for offline viewing

Anybody here have a good setup for batch-downloading articles/news from several sites you specify, similar to youtube-dl but for general websites? I'm sure it could be scripted with not too much effort but I'm interested what polished solutions there are.

The idea would be so people with rare internet access could go to a hotspot weekly or something and sync that week's worth of content.

6 comments

  1. [2]
    stuntaneous
    Link
    I'm not sure if it's useful for batch tasks but I believe HTTrack is highly regarded.

    I'm not sure if it's useful for batch tasks but I believe HTTrack is highly regarded.

    6 votes
    1. cain
      Link Parent
      I'll vouch for HTTrack. Used it to download the entire Stardew Valley wiki onto my laptop for my last deployment in the Navy. Worked like a charm, much farming was done.

      I'll vouch for HTTrack.

      Used it to download the entire Stardew Valley wiki onto my laptop for my last deployment in the Navy. Worked like a charm, much farming was done.

      2 votes
  2. bhrgunatha
    Link
    I prefer wget over curl for this. I made a small script that you pass a URL and the script downloads everything below that URL recursively without crawling the parent levels. Relevant part of that...

    I prefer wget over curl for this.

    I made a small script that you pass a URL and the script downloads everything below that URL recursively without crawling the parent levels. Relevant part of that script is:

    URL="$1"
    
    wget -r -l inf -np -nH -U "Mozilla/5.0 Gecko Firefox/60.0" \
    -k -c -N -w1 --no-check-certificate -e robots=off --random-wait "$URL"
    # -r                      recursive
    # -l inf                  all sub directories
    # -np                     do NOT crawl parent directories
    # -nH                     do NOT create host directories locally
    # -k                      convert links (to refer to local directories)
    # -c                      continue (script is restartable)
    # -N                      only retrieve files when newer than local file
    # -w1                     wait interval is 1 second
    # --random-wait           random wait between 0..2 seconds (2 * wait interval)
    # --e robots=off           ignore robots.txt (naughty)
    # --no-check-certificate   live dangerously
    
    3 votes
  3. Suppercutz
    Link
    I wanted to poke through a website when I was on vacation some 8 or so years ago and there was an application for windows that downloaded everything at the time. That said, websites just...

    I wanted to poke through a website when I was on vacation some 8 or so years ago and there was an application for windows that downloaded everything at the time.

    That said, websites just referenced code files and linked to picture resources back then - there wasn't all sorts of CMS database stuff that required you to redirect paths to suit the new device's folder setup.

    I'm probably wrong, but that's my busy guess on why your can't download entire sites for offline viewing. They'd probably need to be packaged for you by the site admin.

    1 vote
  4. mftrhu
    Link
    I actually started to work on this project just this week - I stumbled across the How to build a solar-powered website article on HackerNews a week ago, read a few other articles, saw a mention of...

    The idea would be so people with rare internet access could go to a hotspot weekly or something and sync that week's worth of content.

    I actually started to work on this project just this week - I stumbled across the How to build a solar-powered website article on HackerNews a week ago, read a few other articles, saw a mention of RuralCafe, grabbed the paper and I'm now trying to hash out the structure of the project.

    TL;DR of the paper: India, 2008, crappy intermittent internet connection, 256 kbps shared between hundreds of students at universities, RuralCafe got developed to address this. It has two components, local and remote proxy. The local proxy gets the searches and batches them so they can be communicated - via sneakernet if necessary - to the remote proxy. It then makes a search, prefetching the linked pages and possibly media as needed, and delivers it back.

    My idea, roughly, is to build a remote proxy that grabs search results out of SearX, fetching the linked pages through either site-specific scrapers, using APIs when available, or a generic readability script. There's more I want to do, like extracting summaries - I spent this whole morning getting distracted by it - and applying lossy compression, but this is the gist of it.

    My use cases:

    • Fire-and-forget, low-priority queries to hopefully reduce the amount of open tabs and distractions on my desktop;
    • Internet access when on mobile, when I sometimes get speeds as low as 3 kbps and huge packet loss (when the network is available, that is);
    • Internet access through VPNs - a friend of mine lives in a country that censors Internet, where services like Telegram are available but only through VPNs at low (~25 kbps) speeds.
    1 vote