19 votes

Does anyone have experience with tools for locally archiving the web, like Archivebox for example?

Posted 5 days, 20 hours ago by lou

Tags: ask.survey, archives.digital, archivebox

I found myself on the Archivebox website earlier today. After reading some of it, that's the kind of program I could use. The ephemerous nature of the web is bothersome, so much content is lost for one reason or another. Archivebox seems to be one of the most popular tools, and it can automatically mirror my locally downloaded website to archive.org, which is great. It seems complex though, maybe more complex than I usually tolerate these days. Which is why I am asking if anyone has personal experience with Archivebox or other similar programs. Do you find them useful and reliable? Have you ever found in your local storage a webpage that you really liked, which was gone from the web? How's your setup?

Thank ;)

11 comments

[2]
Oxalis
5 days, 18 hours ago
Link
Archivebox is great but I wouldn't rely on it, at least not in the current release state. In its current state, I've had issues with zombie chromium instances stacking up during archive jobs and...

Archivebox is great but I wouldn't rely on it, at least not in the current release state.

In its current state, I've had issues with zombie chromium instances stacking up during archive jobs and slowing my server down to an unresponsive halt and requiring me to do a hard restart.

The project's dev, Pirate, is pretty fast and loose with the codebase. Pushing broken and WIP code to the main repo branch instead of creating feature branches as he works on things, so don't try and jump the gun and run :dev to see if a given issue has been fixed. I learned this the hard way. :\

With all of that said, it's a neat system and it's easy to pop up using their docker container. For now though, I would just follow the project on github so you can get emails as Pirate pushes RC releases so you can see how the project is progressing and when the eventual 0.8.4 release goes live. He's been swapping out a lot of the inner workings so a lot of the user grievances should be remedied.

10 votes
1. hugochisholm
  2 days, 17 hours ago
  Link Parent
  I can second this comment. The tool (or collection of tools, really) is great but is still a work in progress. There are quite a few "current" releases with nothing truly being fully stable. I've...
  
  I can second this comment. The tool (or collection of tools, really) is great but is still a work in progress. There are quite a few "current" releases with nothing truly being fully stable.
  
  I've spent countless hours in the past month or two experimenting with different release versions through Docker and I haven't successfully had it up and running doing everything I want it to (like pulling my pinboard.in feed or others daily, not having Chrome fail in the background, etc). I've had fun doing it to be sure, but it's not quite there yet. I really love the full text search (Sonic) for drilling into an archive, and the ability to make subsequent snapshot over time.
  
  My goal is to have it run silently and dependably on a Synology or old Mac Mini, but that's still a work in progress due mostly to weird bugs.
  
  1 vote
[3]
hungariantoast
5 days, 15 hours ago
Link
https://github.com/gildas-lormeau/SingleFile I use SingleFile to occasionally create local copies of web pages I want to save. It's designed to work well even on JavaScript-heavy websites, and...

https://github.com/gildas-lormeau/SingleFile

I use SingleFile to occasionally create local copies of web pages I want to save. It's designed to work well even on JavaScript-heavy websites, and captures a copy of the page as you are currently viewing it, from inside the browser. I'm pretty sure it has a setting to automatically archive every page you visit, with additional options for excluding specific domains from automatic capture, and more. I don't use any of that though.

It also has the ability to capture multiple tabs at the same time, which is incredibly useful.

And when I say SingleFile "captures a copy of the page from inside the browser", what I mean is this:

If I go to https://www.reddit.com, I see the same old interface as if I went to https://old.reddit.com, because in my account's settings I have disabled the new UI.

If I were to use ArchiveBox, cURL, or Wget to get a copy of reddit.com, it would by-default return a copy of the frontpage with the new UI and me logged out, because those tools don't pass login information or cookies by-default. SingleFile does. If I save a tab using SingleFile, it saves it exactly like I'm looking at. No futzing around with cookies.

It also has a command-line version.

5 votes
1. [2]
  lou (OP)
  5 days, 15 hours ago
  Link Parent
  Cool! What does that mean?
  
  Cool!
  
  What does that mean?
  
  Enable the option "Misc. > add proof of existence" to prove the existence of saved pages by linking the SHA256 of the pages into the blockchain
  
  1 vote
  1. hungariantoast
    5 days, 14 hours ago
    Link Parent
    Not sure. Something to do with ensuring the copy of a web page you saved wasn't modified maybe? The help page for the extension says this:
    
    Not sure. Something to do with ensuring the copy of a web page you saved wasn't modified maybe? The help page for the extension says this:
    
    Misc.
    
    Option: add proof of existence Check this option to create a worldwide proof of the existence of the page you want to save. What is a proof of existence (data anchoring)? Data anchoring consists in building a time-stamped proof of existence for a data by linking it to a tamper resistant and time-stamped blockchain. Data anchoring implementation relies on the resilience and immutability of the Bitcoin blockchain to provide the best possible security level How does this protect my data? The anchoring mechanism only handles data impressions. Your data remains where you calculate the fingerprints, i.e. in the browser. Their confidentiality is totally preserved. The day after your backup you can get freely the proof receipt here: gildas-lormeau.github.io/singlefile-woleet/index.html. A proof receipt will be used to verify the validity of the evidence More information doc.woleet.io
    
    1 vote
ShroudedScribe
5 days, 15 hours ago
Link
I don't use it heavily (and probably should use it more), but I run an install of Linkwarden and it's pretty slick. It's more of a bookmarking app that also takes offline versions of the pages you...

I don't use it heavily (and probably should use it more), but I run an install of Linkwarden and it's pretty slick.

It's more of a bookmarking app that also takes offline versions of the pages you add. So if you're intending to pull multiple pages from one site, archivebox may be a better fit.

But if this meets your needs, it has an iOS app now, which is useful.

There are some similar selfhosted apps in this space, like Wallabag. Linkwarden is my favorite to date, though.

4 votes
gil
3 days, 8 hours ago
Link
I'm hosting my own Archivebox instance but it's not really a Archive.org clone I wish it was. It's nice to take snapshots of things I don't wanna lose, but Archive.org is way more advanced in what...
I'm hosting my own Archivebox instance but it's not really a Archive.org clone I wish it was. It's nice to take snapshots of things I don't wanna lose, but Archive.org is way more advanced in what they do. I still dream of a proper self-hosted Internet Archive.

Sometimes I also crawl and copy some websites with wget, you can try that to see how it goes. Something like this:
```
wget --wait=2 \
     --level=inf \
	 --limit-rate=20K \
	 --recursive \
	 --page-requisites \
	 --user-agent=Mozilla \
	 --no-parent \
	 --convert-links \
	 --adjust-extension \
	 --no-clobber \
	 -e robots=off \
	 https://example.com
```
More info here: https://simpleit.rocks/linux/how-to-download-a-website-with-wget-the-right-way/
4 votes
thumbsupemoji
3 days, 2 hours ago
Link
I can tell you that I tried a whole bunch of things recently after a few years out of the game lol, and it turns out that index.html will not get you very far in 2024; so much of any site's...

I can tell you that I tried a whole bunch of things recently after a few years out of the game lol, and it turns out that index.html will not get you very far in 2024; so much of any site's functionality is happening on Someone Else's Computer (thanks The Cloud), so you'll end up with some giant .js blobs that don't do much on their own, typically... if there's another way pls someone tell me what it is!

2 votes
[3]
carrotflowerr
2 days, 17 hours ago
Link
httrack is really good. its used for making an "offline web."

httrack is really good. its used for making an "offline web."

1 vote
1. [2]
  hugochisholm
  2 days, 17 hours ago
  Link Parent
  Last updated in .... 2017. Yikes 😓
  
  Last updated in .... 2017. Yikes 😓
  1. carrotflowerr
    2 days, 14 hours ago
    Link Parent
    the last commit was 7 months ago also, not really something that needs to be bleeding edge in my opinion. it makes an http request and formats it. It's like a curl/wget with crawlers and such for...
    
    the last commit was 7 months ago
    
    also, not really something that needs to be bleeding edge in my opinion. it makes an http request and formats it. It's like a curl/wget with crawlers and such for images, backlinks, etc
    
    1 vote