19
votes
Does anyone have experience with tools for locally archiving the web, like Archivebox for example?
I found myself on the Archivebox website earlier today. After reading some of it, that's the kind of program I could use. The ephemerous nature of the web is bothersome, so much content is lost for one reason or another. Archivebox seems to be one of the most popular tools, and it can automatically mirror my locally downloaded website to archive.org, which is great. It seems complex though, maybe more complex than I usually tolerate these days. Which is why I am asking if anyone has personal experience with Archivebox or other similar programs. Do you find them useful and reliable? Have you ever found in your local storage a webpage that you really liked, which was gone from the web? How's your setup?
Thank ;)
Archivebox is great but I wouldn't rely on it, at least not in the current release state.
In its current state, I've had issues with zombie chromium instances stacking up during archive jobs and slowing my server down to an unresponsive halt and requiring me to do a hard restart.
The project's dev, Pirate, is pretty fast and loose with the codebase. Pushing broken and WIP code to the main repo branch instead of creating feature branches as he works on things, so don't try and jump the gun and run :dev to see if a given issue has been fixed. I learned this the hard way. :\
With all of that said, it's a neat system and it's easy to pop up using their docker container. For now though, I would just follow the project on github so you can get emails as Pirate pushes RC releases so you can see how the project is progressing and when the eventual 0.8.4 release goes live. He's been swapping out a lot of the inner workings so a lot of the user grievances should be remedied.
I can second this comment. The tool (or collection of tools, really) is great but is still a work in progress. There are quite a few "current" releases with nothing truly being fully stable.
I've spent countless hours in the past month or two experimenting with different release versions through Docker and I haven't successfully had it up and running doing everything I want it to (like pulling my pinboard.in feed or others daily, not having Chrome fail in the background, etc). I've had fun doing it to be sure, but it's not quite there yet. I really love the full text search (Sonic) for drilling into an archive, and the ability to make subsequent snapshot over time.
My goal is to have it run silently and dependably on a Synology or old Mac Mini, but that's still a work in progress due mostly to weird bugs.
https://github.com/gildas-lormeau/SingleFile
I use SingleFile to occasionally create local copies of web pages I want to save. It's designed to work well even on JavaScript-heavy websites, and captures a copy of the page as you are currently viewing it, from inside the browser. I'm pretty sure it has a setting to automatically archive every page you visit, with additional options for excluding specific domains from automatic capture, and more. I don't use any of that though.
It also has the ability to capture multiple tabs at the same time, which is incredibly useful.
And when I say SingleFile "captures a copy of the page from inside the browser", what I mean is this:
If I go to https://www.reddit.com, I see the same old interface as if I went to https://old.reddit.com, because in my account's settings I have disabled the new UI.
If I were to use ArchiveBox, cURL, or Wget to get a copy of reddit.com, it would by-default return a copy of the frontpage with the new UI and me logged out, because those tools don't pass login information or cookies by-default. SingleFile does. If I save a tab using SingleFile, it saves it exactly like I'm looking at. No futzing around with cookies.
It also has a command-line version.
Cool!
What does that mean?
Not sure. Something to do with ensuring the copy of a web page you saved wasn't modified maybe? The help page for the extension says this:
I don't use it heavily (and probably should use it more), but I run an install of Linkwarden and it's pretty slick.
It's more of a bookmarking app that also takes offline versions of the pages you add. So if you're intending to pull multiple pages from one site, archivebox may be a better fit.
But if this meets your needs, it has an iOS app now, which is useful.
There are some similar selfhosted apps in this space, like Wallabag. Linkwarden is my favorite to date, though.
I'm hosting my own Archivebox instance but it's not really a Archive.org clone I wish it was. It's nice to take snapshots of things I don't wanna lose, but Archive.org is way more advanced in what they do. I still dream of a proper self-hosted Internet Archive.
Sometimes I also crawl and copy some websites with wget, you can try that to see how it goes. Something like this:
More info here: https://simpleit.rocks/linux/how-to-download-a-website-with-wget-the-right-way/
I can tell you that I tried a whole bunch of things recently after a few years out of the game lol, and it turns out that index.html will not get you very far in 2024; so much of any site's functionality is happening on Someone Else's Computer (thanks The Cloud), so you'll end up with some giant .js blobs that don't do much on their own, typically... if there's another way pls someone tell me what it is!
httrack is really good. its used for making an "offline web."
Last updated in .... 2017. Yikes 😓
the last commit was 7 months ago
also, not really something that needs to be bleeding edge in my opinion. it makes an http request and formats it. It's like a curl/wget with crawlers and such for images, backlinks, etc