28
votes
What is the "jesus nut" of your field?
Recently read a Reddit post about a jesus nut: the nut that holds the main rotor to the rest of a helicopter. In more general terms, it applies to a single point that would cause a catastrophic failure if it were to fail.
What is the "jesus nut" (single point of failure) of your field?
Software guy here. Nothing works if there's no electricity. Doesn't matter how good your failover systems are, doesn't matter how robust you make everything, if there's a blackout shit isn't going to work the moment your UPSs run out of juice.
Oh but just have multiple systems / the cloud? That's an improvement, but at some point the human needs a working screen in front of them.
Reminds me of when Telegram went down across Europe because of a small power outage in Amsterdam
Another software guy here. Power outages are nasty and all, but generally you can still recover from them. Work may grind to a halt, but that's still not too big a deal.
You know what terrifies me? Databases. It's a single point of failure that could spell the end of your business if you're not careful, and the shear variety of possible causes for that failure is enough to leave you paranoid. It's a single point of failure that can be caused to fail by a wide variety of other points of failure. It's like a string of dominoes where the last domino will press the launch button for a nuclear warhead, and all you have to do is knock down one of the other dominoes nearby to make that happen.
Backups not running? Had it happen. Backups unknowingly corrupted and seemingly fine at first glance despite the corruption? Had it happen. Web host's physical server's hard drive containing critical data failing? Had it happen. Unexpected web host outage? Had it happen. Bad database replication? Had it happen. This list is smaller than the list of all things I've dealt with concerning databases, and that list is still far smaller than the number of things other people have dealt with concerning databases.
If you weren't smart and you put all of your eggs in one basket, so to speak, and your web host's data center goes up in literal flames and contained all of your database nodes and backups, then you just spent a whole hell of a lot of money on a failed business because, guess what, storing, managing, and delivering that data was basically the very core of your business. The application itself that you built is just a fancy portal to that data with user-friendly ways to manipulate and visualize it.
If you want to make absolutely certain that you're not going to wake up at 3am one morning to your boss calling and saying "the application is down and I was just informed that all of our data is lost and can't be recovered, so you and everyone else at the company are effectively unemployed now", then you have to cover all of those possible points of failure. Sure, quite a few of them aren't particularly likely, but they're still feasible. If you can't, e.g. you're working at a startup or another small business that can't afford that kind of budget, then your business is basically sitting beneath a guillotine and you're just hoping that the (hopefully) unlikely but very real possibility of it coming crashing down on you doesn't happen. Truth be told, if you have basic replication and backups in place, then the guillotine will probably never come down, but it still can.
Sure, you could argue that having all of those additional points of failure and covering all of your bases makes this not really a Jesus nut, but I would argue otherwise because databases aren't designed with that sort of resiliency out of the box. You have to deliberately go through a complex process to set up a convoluted architectural mess of points of data redundancy and backups if you want to achieve that level of resiliency. That kind of setup is the equivalent of redesigning the Jesus nut or its surroundings to make it not a single point of failure. It's a single point of failure until you've explicitly engineered it not to be.
tl;dr - Databases terrify me more than a power outage ever will.
Database failures are literally always an avoidable mistake though. You should always have multiple backups, and if they aren't working and the original goes down, that isn't a single point of failure, it's 2 points.
Alright, now that I'm up and about, I think I've come up with a satisfactory explanation of my intended argument:
If you don't deliberately set up your database architecture correctly, you're only trading one single point of failure for another. For instance, multiple backups don't matter if they're located on the same physical server as your database. One hard drive failure later and your data is completely lost. Less risky is to store them across multiple physical servers within the same data center, but a single data center fire could wipe out everything as well. Even if you distribute across multiple data centers, if the hosting company itself suddenly goes dark and closes up for good without warning or they do something incredibly stupid, then your data is still gone. There are lots of solutions that end up being put into practice that are just changing what the Jesus nut actually is, but it all centers around the database itself.
Now, clearly you can take responsible measures like including offsite backups, having a database node or two that are hosted elsewhere, etc., but one of the things I'd noted is that smaller businesses often cannot afford the price tag on this kind of architecture. They have to cut corners on this. If they have database replication, there will only be a few nodes at most and they're unlikely to be distributed across data centers and especially not across services. If they don't have replication, then if their database goes down they could find themselves spending potentially hours restoring from a backup (shorter if they have a good snapshotting process in place rather than doing a dump) with a potential loss of substantial amounts of critical data. If they have backups, they're probably stored with the same service as their database is (see @SaucedButLeaking's comment) and an architectural failure on the host's part could result in a complete loss.
Lack of resources means you put additional barriers in place and allow yourself to partially recover in most cases, but you still end up having a Jesus nut in the end. You end up reducing your risk, but not fully eliminating it.
You're absolutely correct that these failures are completely avoidable if resources aren't a problem. It's just that if you lack those resources, then those failures are only almost always avoidable, and that "almost" is our Jesus nut, just phrased differently.
Our software is sold to small groups of people, usually without dedicated IT. It's also mission critical and possibly life-saving (or threatening). We tell them to adhere to the 3-2-1 principle. At least 3 physical backups in at least 2 physical locations at least 1 time a week.
Because their resources are limited, and due to the nature of the software, if they're unable to comply with that principle, we refer them to our services team at whatever rate they can afford.
Databases can be and often are scary, as a point of failure, but they don't have to be.
There's more to my argument that I omitted initially for brevity, but would have addressed additional concerns here. Unfortunately it's late and I need to sleep, but I'll try to provide a more coherent and complete answer at another time :)
The worst day I had in 4.5 years of working support for a web host was the day 2 drives in our RAID 5 array failed. I was scheduled 10-6 that day.
11:00 - The physical machine hosting two virtualized shared hosts (with a couple hundred accounts apiece) crashes. Attempting a reboot shows all sorts of scary error messages that amount to "escalate this; shit's broke"
11:15 - The sysadmin asks the CTO to give him a hand in the server room.
11:30 - Calls start coming in from people wondering why their sites aren't coming up. Standard procedure is to open a ticket and get them off the phone. So far, nothing seems amiss. Either the sysadmin or the CTO will have some magic to work, and it'll be fine by end of shift.
12:30 - Our sysadmin and CTO have been in the server room for about an hour now, trying everything they can to get the drive booted so we can start recovering whatever data there is
13:00 - The sysadmin looks at me and says "it isn't looking good" with that Oh Shit kind of look
15:30 - It's already been One of Those Days, and we still don't have anything promising from the server room. Several of the more anal customers have called 4 or 5 times by now.
16:00 - The CTO explains to the CEO that there is not enough of the array left uncorrupted to get any of the data off of the drives. Ticket count for this issue is roughly 50. Manager and I start drafting a notice stating that the servers died and there's no backup. I sign out of the phone queue and we start notifying everyone who was affected.
17:00 - Open tickets about the issue are over 100. Manager and I sign back into phone queue after we order pizza
19:00 - I've taken over 50 calls (normal day usually topped out around 25) and the call queue is full. Ticket queue is the highest I've ever seen it. Audible sighs and groans are heard after every call ends, while the next one is ringing.
22:00 - Manager, myself, and the other tech who was working the mid shift call it a night. Night shift came in around 16:00, so she's been in the shit with us. Manager briefs the overnight guy, who gets in at midnight.
I don't have a number for how many cancellations we processed, or refunds we issued. I do know that there were still people just finding out about it about 6 months later. One or two people threatened to sue us over it, but our ToS clearly stated that backups are the responsibility of the customer. One of them expected us to pay for forensic recovery on the drives. The CTO laughed at him.
The moral of the story is: if you're going to host a website on shared hosting, keep your own backups. Don't assume that your host is going to be able or willing to restore your data after a catastrophic failure.
Man, that would be a nightmare. I'd need some drinks after a night like that.
Rolled a fatty that night.
What sucked worse was that I got calls escalated to me for the next 2 weeks because I had the longest fuse and a calming demeanor.
Phew! 24 hours is a lot better than 2 weeks!
But yeah, sometimes something has to blow up before proper procedure gets put in place. For example, one of my tasks over months following The Incident was rolling out backup software onto the shared hosts. I'm pretty sure the hardware to do so was ordered the week after, once the numbers were in.
There is a reason "RAID is redundancy not backup" is one of the mantras in the Data Recovery & Computer Forensics world. When you put together an array and all the drives are the same make/model and come from the same production batch, you would be astonished just how often multiple of those drives will fail at the exact same time. Way too many sysadmins also think the redundancy offered by arrays means they don't need proper backup policy so when their critical array inevitably fails the companies they work for become desperate and willing to pay insane amounts of money for even the slimmest chance of recovery. Failed array recovery was one of the biggest money makers for the Recovery & Forensics company I worked for in the UK for just that reason.
In the theatre, everyone can turn up on performance day but if the performer isn’t there, then nothing is going to happen. Everyone is far more replaceable on that day than the performers.
Don't most roles have understudies for just that reason, or is that only for big productions?
Only really big productions, usually a show can’t afford to pay more than one actor for a role.
One of my facilities works with large furnaces of molten aluminum. If you splash water on the aluminum, you're probably fine, but if water gets trapped beneath the aluminum, you're fucked.
There was one accident years ago at some other company where a disposable water bottle got mixed in with scrap aluminum to be remelted, it was all dumped into the furnace, and it leveled the entire building. The water is superheated, it expands and throws molten aluminum everywhere.
Here is a short, very small explosion. NSFL NSFL
Edit: The worst part about it is that we are a very high-turnover company and we have vendors coming in and out all the time. We have a "no water bottles policy," but all it takes is one person being an idiot, and we all die.
The most dangerous part isn't the waterbottles, though, it's water/coolant getting stuck in the nooks and crannies of the scrap metal and dunking below the surface of the water. And the scrap hoppers are out there for everyone to put anything in. Even though they say "NO WET SCRAP" it happens all the time. Usually it just evaporates before it hits the aluminum, but sometimes there are small "pops" where aluminum gets splashed out and hits the floor/forklift/hopefully not the operator.
Metal, hot enough to be liquid, at an industrial scale, is terrifying. I never knew about water in aluminum, but it makes sense that rapid expansion of water into vapor is going to have a bunch of energy, and once it gets to the point of breaking through the molten aluminum, the pressure is such that a lot of aluminum goes with it.
But, damn, a water bottle leveled a whole building?
That's the most terrifying thing in this topic!
Ennui. Inertia. Whatever you call it, the moment you stop believing in yourself. When you fail to put paint on the surface or pixel on the layer, whether you want to or not.
The one-way valve in a Bagpipe blowstock is the one thing keeping air in your bag, when it fails, you stop playing. Luckily, I haven't had mine fail while I was playing in a long time!
I would suggest that the sole point of failure for a bagpipe player is the bagpipe player. You can always just swap out the valve, but once your lungs go...
You've definitely got a good point. Hopefully, in a few years failing lungs won't be an issue.
Residential electrician here, (Canadian at that so please don't shout nac code at me) but where the cables come into your house panel its almost always aluminum, and the panel connections are always copper. Now the problem is two different metals don't like touching, so you need a little special goop, but so many times I've seen people not put that on, or not tighten the bolt that holds them. Without that connection tight and with that goop those 100amp wires will slip out and potentially burn down your home
I work in software also, for business management and security. Weak/non-existing passwords. Doesn't matter how well everything is set up and what processes you put in place, if you're using the default
Admin/password
stored in plain text, you will be "hacked".I also commented on the databases thread, but this one got us recently.
The default password for the admin account our software uses turned up in a list of hacked accounts online. Shortly thereafter, we started receiving calls about our software going down. One call turned into two, turned into four, etc.
Turns out that, because our target market is small groups without dedicated IT, few of our customers bothered changing the password, even after having been instructed to. Because that password leaked, many of our customers faced malware attacks, mostly ransomware.
Our support team was completely overwhelmed. Every team tangentially related to the software halted everything (deployment, implementation, development) for two weeks to call every customer we have and help them deal with everything.
It was a goddamn nightmare. Change your passwords, folks.
There are so many Jesus Nuts in software.
Passwords, Databases, Electricity. Any one can fail, and the whole system can get suuuuuuuper messed up.
Exciting!
Sysadmin here. I moved from a small business to a decent-sized research university, and one of the biggest changes / reliefs was the fact that we have very few "jesus nuts". Our DR policies go up to the point of "a meteor falls on the datacenter."
Of course, a backhoe in the wrong spot could still cut our fiber and ruin all our days, but that's true of pretty much everyone (looking at you, Level 3)
I don't know if they still provide it, but Level 3 used to blog about their quarterly/annual outage causes by category, and for the longest time, backhoes were always second on the list. Their leading network-wide cause of outages was invariably... squirrels.
I can only assume the third was solar flares.