22
votes
Is there a good S3-compatible datastore for a hobbyist?
I've read nice things about Amazon's S3. There are some compatible implementations from other major vendors like Google and Cloudflare. There are projects that automatically back up and replicate a sqlite database using S3. Some people have backed up Google Photos to S3.
But I've never used any of them. What would be a good way to get started? Amazon or another vendor? (And does this make sense at all?)
My team has been using MinIO as S3 mock. Quite simple to install and use even in CI.
I use minio for my home network. Easy to get up and running, but I haven't done anything advanced.
Tl;dr: Backblaze or Cloudflare are good options assuming you want a hosted service.
Whatever you end up using, make sure to check what it costs to get your data back out again - that’s where they often get you. S3 itself is $0.01-0.02/GB/month for actual storage, for example, but then egress for that GB is $0.09 every time anyone needs to access it. Large amounts of data, or moderate amounts to large numbers of people, add up fast.
There are a ton of S3 compatible stores out there, so I won’t claim to know the details on every single one, but I’m with Backblaze for personal use (NAS offsite backups) and they’ve been consistently great - cheap, tech focused, and the reliability of being a massive player who only really does storage. Egress is pretty good at $0.01, but since they’re backup focused (i.e. mostly writes, very occasional big reads) there’s also the “fedex a hard drive” option to avoid that entirely, which I appreciate.
At work we’ve historically used S3 as there’s a lot of data we create and process with AWS without it needing to leave, but the lock-in implications are getting worrying - nobody wants a five figure exit bill just to shift to another cloud. We’re trialling Cloudflare R2 with moderate success so far: the major advantage is totally free egress, and knowing what their bandwidth capacity is like I do actually trust them to stick with that, but it only implements the basic S3 API (no versioning, no lifecycle policies, worse Terraform support, newer and less proven reliability). For hobbyist use that stuff likely matters a lot less, so it may well be a good shout.
Thanks, I knew Backblaze was a backup service but didn’t realize that they also provide S3-compatible storage.
Since it’s much cheaper than Amazon, I was wondering what the catch is. Apparently your data is stored in one region. There are only a few regions. Replication is possible but quite limited. That does seem very reasonable for most purposes, but maybe not a great fit for edge servers like Deno Deploy?
(Caring about multi-region replication is admittedly very weird for a hobbyist. Also, litestream is single-writer so maybe one region is fine?)
I see that Litestream has replica guides for a list of S3 compatible services and that seems like a good place to look popular services to use with sqlite.
I don’t really know what I want, but I’m thinking of using this for building websites, so maybe integrating with a CDN is good.
Definitely smart to be wary of underpricing, although I will say that S3 isn’t necessarily the best benchmark: their storage charges are nothing to write home about, and then the egress is just a naked cash grab - it’s literally hundreds if not thousands of times more than cost, depending on which colo provider you ask.
In terms of replication, I’m the one with offsite backups for a personal NAS here so I’m certainly not going to judge you for thinking about it on a hobby project! What I will say is that it sounds like you’re probably more in the market for a CDN to point at your storage bucket rather than replicating that bucket across regions.
Assuming you’re thinking about it for performance, Cloudflare are arguably the kings/queens/gender neutral heads of state in the space. They’re relative newbies to storage, but networking is their bread and butter and they’ve got strong edge compute options to go with it. Whichever provider you go for, a CDN is tuned for performance (vs object stores that are tuned for durability) and the optimisations they can do on the fly with real-time knowledge of the traffic, the state of the network, and the state of the edge storage will generally outperform any simple preset replication strategy you can think of. It also means you pay for durability on a single copy and then just for what you use at the edge, rather than full duplication multiple times.
If you did want it for disaster recovery I’d use a zero-egress provider as your primary store and then sync it to a different cloud provider entirely - it’s literally cheaper to sync another provider to S3 than it is to sync between AWS regions, and you get much better redundancy too. If you could get away with a short downtime in the 0.00…% chance of a total loss of your primary provider, you can use the archive tier on your secondary provider for fractions of a cent per GB.
For Deno Deploy, the kind of performance I'm thinking about is cold-start latency. Someone visits the website, the server starts up and loads some kind of smallish database into memory, then it decides what to do and maybe serves up a larger amount of data, possibly modifying as it sends it. On updates, the smallish database needs to get replicated. Deno Deploy has no local state currently, so it needs to happen somewhere else.
I have a decent setup using Neon for a Postgres database that starts up on demand, so maybe I should stick with that. And Deno has a KV store in closed beta.
The part where I'm over-thinking it is that Deno Deploy has edge nodes in different regions around the world and I wonder what latency would be like for them. I could make it fast for me by putting the database in the western US. :-)
Cloudflare certainly has impressive offerings but the free tier doesn't include the more interesting stuff and I haven't tried them yet.
The other thing I should really do someday is back up Google Photos somewhere. There is an rclone tool but it has lots of limitations.
That makes total sense! If I’m understanding correctly, you could use a CDN URL for the
lightstream restore
call that runs when a Deno Deploy instance boots? That way their servers would get the same speed and locality advantages that end users would, without having to guess in advance which of their regions you’ll end up using most.Of course, if you wanted your storage to be on the same physical network as the servers it’d be best to have them both from a single provider, and Deno Deploy’s region list definitely looks familiar. Litestream supports GCS natively too, so might well fit together quite nicely for what you’re doing - although I will say that Neon looks very cool too, and it’s one I hadn’t come across so cheers for another bit of interesting tech to read up on on the train!
Scaleway offers 75 GB of storage for free and includes 75 GB of egress per month.
If your data payload is under 25GB with 25GB monthly egress, you could try the free tier of https://storj.io
They have their own durable storage backend and provide a free S3 compatible API so you can use whatever S3 tooling you want.
I've been using their free tier for testing restic backup and it's been solid for over a year.
If on-site is your thing, then MinIO, as said by others, is the de facto choice. I believe you can use it with one physical disk but you wont get any of the storage benefits that S3 and similar provide.
I chose backblaze storage for a side project. Seems nice and it has a free tier.
@skybrian I have also heard of Wasabi storage: https://wasabi.com/cloud-storage-pricing/#cost-estimates
NOTE: I have never used them myself; so have no experience with them. I have only heard about them some time ago from technology podcasters...and the above link shows a comparison pricing vs other providers. At a quick glance, it appears that they do not charge for egree, which sounds great. But, again, i have no idea how reliable or performant they are (or not).
I looked into using Wasabi a few years ago for storing images on my mastodon instance and I did a lot of searching through their documentation to figure out the specifics because "no egress fees" seemed too good to be true for one monthly price and it turned out it was.
Egress is limited to however much data you have stored for the month.
So any use-case where files are read more than they are written, would not be allowed and you'd need to use the alternative pricing model.
There's also a 90 day minimum for any data stored.
Ah-ha, so that's how they get you! Thanks for sharing!
+1 to Minio, using it for a side project. Easy to spin up a docker container.
I'd also mention that both AWS and GCS provide a free tier. For hobbyist use this could be more than you need. For S3 it's 5GB free for first year, for GCS it's 5GB free indefinitely.
If you're interested in self hosting Garage looks quite good.
There's lots of option for self-hosting. I think Ceph gets a lot of use by businesses, and IIRC you can even set up Apache to work that way if you're a sadist who likes to deep-dive in to manual configurations.
I think storj is what you're looking for.
I use CloudFlare, I do like it.