My most painful Linux experience
Yesterday presented the biggest pain over my few years of using Linux that I have personally encountered. With the current prevalence of topics related to Linux, and especially ones related to new users, I figured it would be good to share and leave a place for others to share any similar stories (and ideally how to avoid them).
The problem I encountered was effectively that my machine crashed and I was locked out on reboot. I'll describe how I crashed it later and for now just focus on the locked out bit.
During startup something was failing and as a result it would dump me into emergency mode. Emergency mode is basically just a root terminal where your ultimate goal is usually to read your logs and fix whatever was logged as failing. Annoying, but not a real issue. The real issue was that I was also locked out of emergency mode! This meant that literally the only thing I could do was get into a boot cycle telling me everything is locked.
So I head off to forums on my phone looking for what cryptic wizardry I'm going to need to perform. I need to a live boot OS because it is impossible to fix from my current install. I have to live boot another image, mount my original primary partition (after decrypting it), chroot
the new mount point, and then use passwd
to set a new root password. If I'm smart I'll come back to this thread later, when I'm not on my phone, and edit in or reply the actual commands needed since in reality I found myself piecing them together from across the Internet and maybe I'll need them again some day.
For avoiding this: check you have a root password. You may think you have one but might not. Set it to anything. Do it now, not after you're already locked out. The reason for being locked out of emergency mode was that passwordless root is locked, but there's no way to unlock it in emergency mode. I personally encountered this on Arch, but my search for error text was also taking me to Fedora forums so I don't think it is related to distro beyond the distro supporting no root password.
The bit down here is a bit less relevant as it is specific to my case.
Ultimately, I had an invalid /etc/fstab entry for a secondary drive (NTFS extra storage, not boot-critical). The thing is that entry has been there through months of daily boots and had worked, even though it may have been giving warnings or something. It's still lost on me as to why that suddenly became a boot blocker.
I'm pretty sure the original crash was my fault, although it seems pretty insane that what I was doing can break everything to the level it does. I was working on some Vulkan code and I definitely had some bugs in it that made my shader capable of reading out-of-bounds memory, but one would think this would stop at crashing the application. Instead it was causing graphical issues across the entire machine as if I'd simultaneously broken the logical drivers for every application, desktop environment included, at once. If I was lucky Plasma would reboot the whole desktop, if I was unlucky everything was completely frozen with mouse and keyboard doing nothing at all. It was me using the power button to escape the locked machine that triggered my chain of events.
For whatever reason on reboot it behaved differently than before. I'm still not sure why. I hadn't applied any updates or anything during that boot cycle. I shut this particular machine down every night and the issue was on a reboot, not my first boot of the day.
One issue that can come up with a custom fstab is using device names instead of UUID. The system can 'randomly' assign different drives to different device names, so it could work 5 times in a row and then suddenly the wrong drive has that name. You can use blkid to list them and set it correctly.
There are other problems with secure boot and updating nvidia drivers when dualbooting but that's all magic to me. I use AMD and disable secure boot anyway.
I'll double check, but I think it's always been mounted by uuid. Admittedly I'm pretty sure it's had warnings about not liking this NTFS drive, but since it was succeeding anyway I chose to just not care. Maybe I would've been fine if I'd dealt with those months ago.
I should probably also reformat it at some point. I have a strong suspicion it would've been perfectly fine with it if it was ext4. The only reason it is even formatted the way it is that it used to be a Windows drive that I just shoved in this machine as extra space.
Yeah, the
fstab
can do that.systemd
mounts makes things a bit less chaotic. But you can also addnofail,x-systemd.automount,x-systemd.device-timeout=30
to an fstab mount's options to essentially do the same thing that systemd mounts doI've been saved by
ssh
in a lot of these cases. It might take a few hours to figure out how tossh
from your phone but it's definitely worth it! Tailscale makes the networking easy so you can ssh from across the room or across the world (if you don't want to be beholden to the company, there is also the OSS Headscale but it requires quite a bit more setup).That quoted bit about what I was doing was actually in reference to my application. Like why it's possible to make Vulkan calls that cause so much breakage outside the application. I would've been happy with my application segfaulting or really anything contained to the application, but I was losing the entire desktop. I've assumed that the terminal would still work if I put in enough effort to access it, but just rebooting seems like it would be faster than getting into terminal trying to fix it from the state it's in.
At some point I'll get around to mounting that drive again and was thinking I'll remove it from fstab, although maybe I'll use the options you suggested. My first instinct was basically that since it's just extra storage that isn't boot critical I should just mount it later. Figured there would probably be an alternative, such as systemd mounts, but if I couldn't find anything I could just make my own service that mounts it.
Here's an example. Let's pretend we have this line in
/etc/fstab
The one tricky part is that the name of the mount unit has to be the same as the mount point--
eg. /net/backup becomes net-backup.mount
Then you "install" it by doing this--same syntax as normal systemd service units:
I've read folks refer to Tailscale before, is it a sort of all in one solution where if I've got it on the host and client they can connect even when I don't have a DDNS/explicit IP endpoint? Or do I still need to do the standard setup and Tailscale "just" wraps and organizes the VPN? I couldn't really tell from their webpage.
Yes--basically it helps you create a private overlay network which acts like a VPN but only between your devices. It's pretty lightweight and uses wireguard under the hood. Tailscale routes traffic over the shortest path possible. In most cases, this is a direct, peer-to-peer connection. But if you're behind CGNAT or other "hard NAT" scenarios relaying through servers might be necessary:
https://tailscale.com/blog/how-nat-traversal-works
Thank you kindly, this was exactly the info I was seeking!
Tailscale automatically handles wireguard to make a mesh network. When you connect to another device, the connection is direct, not through a VPN server. It handles nat punchthrough and cg-nat automatically. In rare case, it will have to use a proxy server run by Tailscale, but that is only for extremely difficult network setups.
When you install it, you basically make a private lan on 100.x.x.x, where all of your devices are trusted. You don’t have to mess around with user accounts or oauth setup. If someone connects to your server via 100.42.42.42, it is guaranteed to be an authorized user, at least within the security built into Tailscale (oauth, 2fa, etc). You also get a ts.net subdomain exclusively for your usage. So you can connect to a server with server-name.something-cow.ts.net. The actual subdomain is assigned randomly.
That is the essential part of the service, but there is a ton more features and customization you can add. The access control list is very customizable, so you can limit people to specific machines. You can have Tailscale serve as auth for ssh. You can relatively easily share devices with other people. You get airdrop like file sharing between any of your devices. You can set a device as an exit node to act as a commercial vpn. It can even integrate with mullvad so you can use mullvad servers as Tailscale exit nodes.
Their free tier is incredibly generous and may look like a red flag « too good to be true », but it is true. If your connections don’t use their relay servers, they only handle an extremely small amount of data for the coordination between clients. So it’s very cheap to run those generous free tiers. And they are financially incentivized to make sure all networks possible can use direct connections, which saves them money and makes their service better for everyone.
Thank you for the explanation, this is quite helpful!
I realise this comment probably has mega "this is good for bitcoin" energy but for me it's a massive win that you didn't just declare the install hosed and recovered it. Everything about Linux is public, including the entire boot setup and using install media to remount and actually fix installs via inspection is something a competent admin can do. I swear in windows land the answer is almost always "recover the data, torch the install". Plain text files ftw!
I am glad that it was fairly easy to solve. Part of why I wanted to share this is there have been so many recent threads regarding switching from Windows to Linux and so it seemed like it would be good to present what the scope of the worst problem I've personally encountered on Linux has actually been in practice. In this case it was really the footgun of getting locked out of emergency mode as the underlying problem was very simple by comparison (I spent far longer obtaining and flashing an ISO than I did actually dealing with the root cause). But its obviously not all sunshine and rainbows either since needing to use a live OS and mount the drive still kind of sucked compared to just getting into a kind of recovery mode gracefully.
Although, if there had been an Arch wiki entry about the emergency lockout issue I doubt this even would've felt significant enough to complain about. Sure I still would've needed to do the steps, but a lot of the real hassle was trying to figure out what the steps were supposed to be. The most helpful thing I found turned out to be https://www.reddit.com/r/linuxquestions/comments/t6795f/comment/kq4pzlb, the lowest voted comment on a multi-year old thread. Stuff like that is, at least to me, the real rough edge of Linux.
Oh yeah totally. You want new users with open eyes and realistic expectations otherwise no one wins.
re: the iso I always tell myself on each new install that I will make a recovery partition but I never do. I do however always keep a usb with arch/nix on my keys :)
serial distro hopper reporting in... I always have a stick around with at least a dozen isos on it so I am always close to a bootable live image, and i never bother with a recovery partition.
That said, it does frustrate me that several of the distros i've installed over the past couple of years have not pushed you to set a root password. I get that sudo is a thing, but there is no reason I can think of why you wouldn't set one anyway. It almost feels like its a trap for new(ish) users
So true. I've absolutely screwed up my Linux system (or had really bad bugs or whatever) but in the majority of cases (Zorin being a weird one-off exception to the rule), I was able to recover from them without pretty much any lasting "damage". Not that I'm saying everyday users are going to be able to mess with that stuff, but for those with some level of tech knowledge, the amount of power at your fingertips in Linux is awesome. My short time as a sysadmin dealt with plenty of messing with systems in ways that I can't even fathom being able to do on other OSs
And totally agree on the plain text file aspect of things too. It's so awesome that you can literally create a system service by just writing a few lines of text in a file, for instance. There's so many things that are actually WAY more powerful and much more simple than they seem- which are things we take for granted as these esoteric/blackbox kind of things in Windows
What happened with zorin?
It ultimately might have been fixable somehow, but I got myself into some really bad dependency hell with versions of wine (all stemmed from 64 vs 32 bit versions and needing different ones for different programs). So, it got to the point where I couldn't install or uninstall certain things because they needed a dependency, but eventually got into this weird "loop" where to do anything with Y, it needed Z, and to do anything with Z, it needed Y, it was like this weird thing where two dependencies depended on each other and it seemed like it locked me out of solving the problem.
I had done plenty of online research about the problem and something led me to the Zorin forums at one point where the end recommendation ended up being to reinstall because everything tried up to that point had not worked, for someone else that had a similar problem. It's been so long, that I don't remember what about the issue was Zorin-specific necessarily, but I feel like I recall there being some choices Zorin made that compounded the problem, or made it impossible to resolve. Possibly it's built-in Wine handling, IIRC (I think by default the install already has some Wine stuff set up without needing to install it separately, and because of it being tied in with the OS, made it difficult to solve my way out of)
I went far down some dependency trees trying to solve my way up the tree (and have fixed problems this way before, both personally and as a sysadmin) eventually but could never get it to behave and it was to the point it was breaking other things trying to fix the original thing, and just got in such a corrupted state that I decided to just completely blow the install away. It was a laptop and didn't have any significant data or anything important on it, so it was better to start from scratch since it was painless, mostly, to do so
As great as Linux is, there are some situations like this that it just doesn’t handle all that gracefully.
On the surface it makes sense - for server use it’s no big deal if failure during boot just kicks you to a root shell for the admin to sort out. On the desktop though, it’s much more likely not the correct call.
With the recent increased focus on Linux as a desktop OS, I think a great deal of time and effort needs to be put into making it try harder in its failure mode. Both commercial OSes are more graceful in this regard and can automatically boot into a minimal-yet-graphical safe mode that logs into the normal user account, and there’s no reason why this shouldn’t be possible with Linux too. Will it take trial and error to get it working across the hardware gamut? Sure, but that’s no excuse to not do it.
Also, I’m not sure it’s a sane default to allow bad fstab entries to torpedo the system. That should probably be fixed.
In the case of a bad fstab entry I don’t think it’d actually cause a flashing drive icon on a Mac, and I’m not even sure it’d prevent boot. At worst, if the system is rendered unbootable a modern Mac will automatically reboot into the immutable recovery partition, or if that was removed for some reason will even download a system image from Apple (given a network connection) and produce a recovery partition to boot from without deleting user data.
You can try doing Command+Option+R to try to make it go into internet recovery mode, which will let it download a new system image and re image the computer. Can even add shift to that key combo to download the version the Mac came with instead.
I had the exact same crash happen two weeks ago when I added an old hard drive to my NAS, put the UUID into my fstab, formatted the drive to btrfs (which I believe caused the UUID to change) and then on mount the system crashed.
Fortunately for me I'm running NixOS! So the fix was simply to boot into the previous generation, update the UUID to the new correct one and rebuild my configuration. First time I ever saw the emergency mode screen in Linux though, so that was fun! :D
Luckily the actual fix wasn't too bad. I basically just looked for the word "failure" in the journal. The real headache was being locked out of emergency mode. That part sucked. As nice as it is that you can boot a live image and fix stuff by manually mounting your primary partition, it isn't fun to need to.
I'm glad you're sorted out now. I find it strange that any experienced Linux user would not have a password set for root. How did you perform root-only commands? Passwordless su or sudo? Such a setup means the non-root user is essentially equivalent to root. It's just generally better not to run day-to-day user apps as root. What distro is this, that allows installation without setting a root password?
This machine was set up with the built in Arch install script. I no longer remember what exactly I did there though. I mostly just went through and set the stuff that was required and I'm guessing root password wasn't marked as required? I'm not really that sure it's that though as it's not like I had passwordless sudo or anything since sudo still required a password.
According to my searches the results I remember seeing for this issue were Arch and Fedora, so it seems both of them can easily accidentally end up in this state.
Update after further digging:
is present on https://wiki.archlinux.org/title/Archinstall. I must've left it blank. That indicates that it doesn't enable passwordless sudo instead just disabling root entirely to make you have to make other users sudoers. I don't really remember what I did years ago, especially because I don't think I was intending to like this distro when I did a random "let me install it to see it", but it must've been me just clicking through the defaults, including leaving the root password blank, based on the warnings on the page. Or maybe I even thought it was a good thing that it was disabled banking on me not I locking myself out. Either way, past me made a mistake.
It actually seems that's a security feature since attackers looking for SSH holes need to guess both a username and a password rather than attacking root. I think it would probably be good if it was harder to do by accident though. Like if there was a separate setting you had to specifically change to be allowed to disable root as it seems like a somewhat dangerous default.
Alternatively, it would be nice if emergency mode would let you select any sudoer rather than explicitly requiring root. It is probably true that disabled root is marginally better for security, but then it causes problems like a broken emergency mode. Maybe at some point we'll no longer assume that "root" is always enabled.
Yeah I've seen a number of distro installers that even specifically let you choose whether or not to set a root password at install time, likely for that security reason
Sensible sshd config (IMHO) would include disabling root login.
Not having a password for root doesn't preclude you from requiring a user's password for sudo. This is the default configuration for several Linux distros, and the configuration I use most often personally.
If I am in a situation where my system doesn't boot, I prefer to use a bootable USB rather than an emergency shell. Since those situations are so rare and one-off, and keeping the root account fully disabled is a decent security layer.