Tildes will be down for most of this weekend
I'm going to be taking the site down this weekend to do some upgrades and changes to various systems it runs on. I'm planning to start the downtime somewhere around noon on Saturday, and have it running again on Sunday evening (vague times, in a vague North American timezone).
If you're interested in the details, the main reason is to switch operating systems from Ubuntu to Debian. The easiest and safest way to do this is by just setting up a new server and moving the site over, so I'll also be taking the opportunity to switch to a different physical server. Tildes has been running for 3 years now, so I'll be able to rent a new server that's some combination of faster and cheaper (not that we're getting anywhere close to the limits on this server, but I might as well).
Since I was having to review and adjust the whole server setup as part of switching OSes, I also decided to switch configuration-management systems from SaltStack to Ansible. Salt had a major vulnerability last year that compromised thousands of servers, and Tildes really only avoided being affected through the luck of using it in a non-vulnerable way. I've been intending to switch away from it ever since.
And since the site will be down anyway, I'll also be doing some upgrades and changes that are difficult to do without downtime, including upgrading PostgreSQL, Redis, and Python to their newest versions and making some changes to their setups.
So overall, this turned into a large project, and unfortunately it's one of those wonderful development projects where you do a ton of work and if everything goes well, nobody can even notice any difference from the way it was before. It also would have been possible to do all of this with only minimal downtime, but it would have required a lot more prep work and would be more stressful, so I'm just going to do it the easy way and take my time. If you're interested in doing a Screenless Day, maybe this could be a good time to do it!
Thanks for Tildes.
All great changes, lousy to have to spend a weekend on it. Good luck and godspeed.
I've spent more than a few Saturdays doing this very thing. Given your tech stack I doubt you'll run into any trouble - the benefits of simple, reliable, boring technology. ;)
I know/love the micro-downtime when you push updates that hardly lasts longer than a page refresh. Is this the first real downtime we're having? I can't seem to remember the last time the site was offline post-launch.
It is! There was a short one (10 minutes, maybe?) when I did a PostgreSQL version update in October 2019, but I think other than that we only ever had one or two for a couple of minutes when I did something dumb directly on the server.
The site has never gone down "on its own" though. I set up a bunch of different monitoring to alert me if various things crashed or were going wrong, and none of it's ever gone off. I'm definitely not complaining about that, it's been nice to never need to worry about the site crashing when I'm unavailable.
I'd wager you just casually beat most if not all social network uptime records for the first years of operation - not that anyone's really keeping score. :P
To be fair, most well known social media platforms came into the scene when demand (and thus the traffic and upkeep costs needed to keep up with it) were exponentially increasing, unlike tildes.
Also to be equally fair, Tildes design is better than those networks. Most of their web pages are larger than the original Doom game was at this point. All Tildes has to do is pump a little text around. It should scale well enough on that alone, even the current server could likely handle hundreds of thousands of active users without breaking a sweat.
Simplicity and efficiency will carry you a long way when it's time to scale.
I'd be worried that it just doesn't work!
The first thing to do after setting up monitoring is to forkbomb your server just to make sure it works. ;)
Funny to hear forkbomb used in a useful context, takes me back to messing around with command prompt. Oh the days of hiding %0|%0 in a bat file and hoping someone would click... Now I want to set up better monitoring on my stuff so I can stress test it!
Thanks for the head's up and the details! I love me some devops chat :)
Can you elaborate on the why of this change? You described why you'll move servers and ditch Salt and we know it's generally good to update to newer versions of anything like Python when given the chance, but not the OS change.
There isn't really a single overwhelming reason, but the version of Ubuntu I'm using is at its end-of-support date, so I would have had to do some kind of OS switch regardless. It could have just been an upgrade to a newer Ubuntu, but those don't always go smoothly either.
Ubuntu's based on Debian, and I overall have more trust in the work the Debian maintainers do to make sure everything's stable and secure on it. I was more familiar with Ubuntu, but I probably should have gone with Debian in the first place. I'm also a little uncomfortable with some of the directions I've seen Ubuntu moving in lately, including the push towards Snap packages.
I've read a little about Nix, but I don't really have a solid understanding of it at all. It has some great concepts, but it also seems really complex and unique.
I'd probably like to tinker around with it on my own PC or maybe a VPS that hosts some unimportant stuff, but I'd be too scared of using it for something like Tildes. I'd be worried that I'd run into some kind of obscure problem that I have no idea how to fix myself and can't find any info online about. It's nice to do things in a boring way where it's always easy to find solutions.
Debian seems like a better choice if you just want something that works and stays out of the way. Using Ansible or Docker for your configuration management is going to be more straightforward than wrestling with a somewhat alien package management system. And you will wrestle with it at first.
Don't get me wrong I think NixOS is pretty cool but unless you want to use it or you have a good reason to use it there are more straightforward options available.
I vastly prefer just paying Heroku to take care of it.
Copy-pasta aside, I think it's reasonable to define an operating system as any set of programs bundled together that alone are meant to provide an end user with some base functionality. So the definition of an OS is really very muddy.
Aye aye, captain! Thanks for the heads up and good luck.
I didn't actually get to do mine on the designated weekend, so this is a good push for me to actually fit it in during this upcoming one.
Good luck with the upgrade! Thanks for everything you do for this place!
From my end, it looks like nothing happened at all. Mission acomplished.
Thank you so much for the work you do!
Good luck!
See y’all on the other side.
Make sure you have backups first!
Make sure your backups aren't write only first. I've had that happen a few times and it's never fun. :P
How'd you manage that?
I inherited a datacenter full of about 60 machines that all had tape drive backups, running locally, to 60 tape drives, attached locally, and all using whatever rando backup software/system the developers of that particular project decided to use. Literally none of them were backing up what was required for a proper bare metal restore when I got them. :/
Virtual machines are a blessing. I don't miss those shenanigans.
Good luck, see you on the other side!
Can you give us a warning when you're about to shut the site down like say, 30 minutes before you do it?
About an hour from now.