[SOLVED] Tech support request: Recovering from hard crashes in Linux
EDIT: Latest update
This is something so rudimentary that I'm a little embarrassed to ask, but I've also tried looking around online to no avail. One of the hard parts about being a Linux newbie is that the amount of support material out there seems to differ based on distro, DE, and also time, so posts from even a year or two ago can be outdated or inapplicable.
Here's my situation: I'm a newbie Linux user running Pop!_OS 19.10 with the GNOME desktop environment. Occasionally, games I'm playing will hard crash and lock up my system completely, leaving a still image of the game frozen on the screen indefinitely. The system stays there, completely unresponsive to seemingly any inputs. It doesn't happen often, but when it does it's almost always when I'm running a Windows game through Steam's Proton layer. I suspect it also might have something to do with graphics drivers, as I'll at times notice an uptick in frequency after certain updates, though that might just be me finding a suspicious pattern where none exists.
Anyway, what I don't know how to do is gracefully exit or recover from these crashes. No keyboard shortcut seems to work, and I end up having to hold the power button on my computer until it abruptly shuts off. This seems to be the "worse case scenario" for handling it, so if there is a better way I should go about this, I'd love to know about it.
EDIT: I really want to thank everyone for their help so far. My initial question has been answered, and for posterity's sake I'd like to post the solution here, to anyone who is searching around for this same issue and ends up in this thread:
- Use
CTRL+ALT+F3/F4/F5/F6
keys to access a terminal, where you can try to kill any offending processes and reboot if needed. - If that fails, use
ALT+SYSRQ+R-E-I-S-U-B
.
With that out of the way, I've added more information about the crashes specifically to the thread, primarily here, and some people are helping me out with diagnosing the issue. This thread is now less about the proper way to deal with the crash than it is about trying to identify the cause of the crash and prevent it in the first place.
Usually—but not always—when the graphics freeze like that, the cause is a kernel panic. (Proprietary graphics drivers are a prime source of panics, too.) It's not possible to recover from a panic; it's a "things have gone too badly to know how to recover" response and the kernel intentionally halts the scheduler, prints a message to the console (which unfortunately you can't see in graphics mode) and enters an infinite HLT loop, forcing the system to be hard-rebooted.
You can confirm that the crashes are kernel panics by enabling kernel crash dumps (here are instructions for Ubuntu which I assume would also apply to Pop!) and seeing if a crash dump is generated and if it indicates a panic. Unfortunately, graphics-driver driven panics aren't really actionable (unless you're in the extremely enviable position of being able to compel Nvidia or AMD to fix or open source their drivers…).
If the crashes aren't panics, look into enabling and using the Magic SysRq key to recover or at least cleanly shut down.
I think that AMD does have open source versions of their drivers:
https://wiki.archlinux.org/index.php/AMDGPU
Haven't used them personally though, so no idea how well they work (or not).
Response to both you and @pseudolobster:
Thank you! I appreciate your helpful instructions, not just for what to do but for the surrounding framework for understanding them as well.
The instructions for enabling kernel crash dumps look pretty intensive, so I'm going to start with the console and Magic SysRq key and try those out the next time it happens. Interestingly, my SysRq default value is "176" instead of "1" which, according to this limits a lot of what it can do with it? I tried REISUB and nothing visibly happened until the B, at which point my computer restarted. I don't know if REISU actually accomplished anything in the background when I did it, or if it's disabled by my default value. Let me know if I should change that to 1.
I also tried the CTRL+ALT+F3 key command, which did get me to a terminal, so hopefully I can use that for recovery as well. I'll post back here the next time it happens with the results, but that might not be for a couple of days/weeks (there's no telling when it will happen again -- sometimes it's multiple times a week, others it's smooth for weeks at a time).
Unfortunately I don't think you can change that without recompiling the kernel. According to that thread, the S, U, and B features all work, which ought to be enough to finish writing things to disk if it's in the middle of something, and will at least prevent major disk corruption (which already isn't as big a concern as it used to be before we used journaling filesystems.)
Ah, good to know. At the very least it should be better than holding down the power button!
There's a couple emergency keys you can use in linux to recover from crashes. For lockups where your input is frozen you can try CTRL+ALT+F# to switch to another console. On Pop!_OS, console 1 is your login screen and 2 is your desktop, but 3 through 6 are terminals. If you hit CTRL+ALT+F3 you'll get a login, from there you can use a command like
ps aux
ortop
to find the offending process and kill it. You might also be able to justkillall steam
orkillall proton
or something like that.If it's more severe than that, like your kernel has actually locked up due to a driver issue, the only tool you may have at your disposal is the "Magic SysReq" key, which won't let you kill the process or save your work, but it'll at least cleanly shut down the system. Basically the printscreen key is also known as the SysReq key, and it has magic powers to talk directly to the kernel. By pressing Alt+PrintScreen and a letter, you can tell the kernel to do things like close all open files, sync the disks, and restart. One mnemonic people generally use for the shutdown sequence is "Raising Skinny Elephants Is Utterly Boring", so you hold down Alt+PrintScreen and type REISUB. More info on that here.
Edit: I think I just realized now, after 20 years, that "raising skinny elephants..." doesn't actually work as a mnemonic. The "S" is in the wrong place. The other mnemonic I learned was "Reboot Even If System Utterly Broken", which matches "REISUB".
The "S" though, is to "sync the disks", ie: flush the disk buffers, which some people recommend you do multiple times to ensure it actually worked. So, maybe RSEISUB isn't bad practice. Maybe throw some extra S's in there for good measure. Can't hurt.
In addition to @whbboyd 's excellent answer, can I just say in reference to this:
Nah, don't be embarrassed. Nothing wrong with asking. Also you searched first. And you asked a really well formed question, lots of detail and a clear request at the end.
Nobody is asking why it's crashing, and this sort of hard lock is unlikely to be a software issue in modern operating systems and modern hardware - this isn't the 90s when all it took to crash anything that wasn't a VAX was a dirty look at the chassis. It's possible you've got a hardware problem of some kind. Gaming puts more stress on a desktop than any other common activity, so if there are issues, it'll bring them out in a system that's otherwise sleeping while doing web browsing or other simple tasks.
First order of business, leave a RAM checker running overnight and see if it teases out any memory issues. I don't think it's likely since memory problems rarely manifest the exact same error every time, they are more ephemeral and tend to cause unpredictable hard to reproduce errors - but it's worth checking. If there's a similar health checkup tool available for your graphics card maybe run that too, it has its own memory and processor. Depending on the quality of your BIOS you may be able to enable logging of some information in there that will survive the OS crash.
Check your fans, make sure they are all clean and functional, make sure the heatsinks are clear of dusty buildup, that your power supply isn't harboring a dust bunny - a simple can of air will do wonders in no time. Heat is not your friend.
Since it only happens when you are gaming it smells like a temperature or voltage regulation issue to me. There are plenty of programs that can query your mainboard and/or GPU to monitor that information. Wonky voltages can cause exactly this sort of hang in a graphics card. Those can come from the power supply or the mainboard or the card - anywhere a bad capacitor is lurking. The easiest way to rule out the GPU is to pop in any other GPU and stress the system again. If you've got two GPUs (say the built in intel 'gpu' and an added card) it's possible they are tripping over each other, I've run into that issue a couple times. Best to disable the built in graphics, probably with a BIOS setting.
It's a laptop with pretty terrible heat dispersion, but I've got it living on a pretty powerful fan. I don't think it's heat-caused because, when it crashes, it usually happens early on, shortly after loading the game (usually within the first minute or so). I'm using an NVIDIA card with the proprietary drivers, which I suspect is the cause since I've read many complaints of crashes with them.
A crash happened shortly before I made this post, so I dug through
journalctl -r
for a bit (thanks for that tip, by the way!) and think I found where it happened, though I could be wrong (I polluted the journal with reboots trying out the different things suggested in this thread, as well as other things online, so it's hard to know if this particular one is the one where it crashed mid-game).Shortly before the lockup, it gave several
GPU has fallen off the bus
messages, followed by:I'm assuming I can't do that since it was presumably unloaded when I restarted. There are also two lines in red, which I'm assuming means they're important?
PCI post-resume error -19!
HC died; cleaning up
Anyway, I throw these out here not because you're on the hook for diagnosing anything but just because it might be worthwhile to someone who knows what's going on.
Did any of the crashes happen when your laptop was charging and didn’t live of battery? Also, can you reproduce the crashes, or does it crash randomly? Do some games work?
My laptop is essentially a desktop replacement, so it's almost always on the charger, so I'd say all of them (to the best of my memory) have happened while charging rather than on battery. Also, whenever it's off the charger I'm usually not gaming on it, and I don't think it's ever triggered when I'm not gaming (though again, this is to the best of my memory).
I'm not able to reproduce the crashes, as they seem to occur randomly. The closest I've come to reproducing it was that I could get it to reliably fail within the first minute of loading up The Witness. That's actually the reason I think it's related to drivers. I put 20+ hours into The Witness on this computer with no issues, and with multiple play sessions lasting multiple hours without a crash. After a graphics card driver update, I started getting these hard crashes almost every time I booted into the game. I would get a few seconds of playability and then it would hard lock.
I initially blamed the crashes on Proton, thinking that it was a regression in one of their updates, but I tried rolling back to different versions of that and the crashes persisted. The issue wasn't resolved until after I updated my graphics card drivers again. The crashes then stopped and the game was fully playable again.
Update: to keep those in the loop that might be interested, I've been in communication with System76's support team. They had me do some tests and confirmed that it's not a driver issue but a hardware issue. Given that the laptop is still under warranty, they're going to replace the graphics card for me.
I want to thank everyone here that helped me, and especially those of you that clued me in to the idea that it was a more serious hardware issue than I thought. Without your input it's unlikely that I'd be getting this replaced under warranty, as I would have just ignored the issue until it got worse, likely after the warranty expired.
Alright, so it happened again, this time in a native Linux game. This is now twice in two days, and I think I recently updated my graphics driver in the past week, though I'm not sure on that point.
Anyway, I first tried
CTRL+ALT+F3/F4/F5/F6
to get a terminal, but none of those worked, so I then didALT+SYSRQ+REISUB
which successfully rebooted the computer.In case anyone's interested, here's what
journalctl -r
looks like just before and during the crash (also I'm not entirely sure what I've shared here, so if any of the information here is sensitive, identifying, or compromising, please let me know so I can edit it out):Crash Journal
There doesn't seem to be anything particularly sensitive in that.
So the "GPU has fallen off the bus" error means the video card has completely stopped talking to the system. This could be that it's faulty, it's overheated, or possibly some other reason. But it got me curious. It seems like the rest of the system is still functioning, but you obviously can't see anything on the screen. I wonder if at that point the system is still responsive.
If this happens again, what I'd be inclined to try is hitting the caps lock key and seeing if it lights up. If the system is still responsive, one thing you could try is hitting CTRL+ALT+F3, type your username, enter, your password, enter, then type
sudo nvidia-bug-report.sh
, enter, your password, enter, then wait ten or fifteen seconds. After you reboot (btw at this point if your system is still responsive you could just typereboot
to reboot), you may find a file called nvidia-bug-report.log.gz in your home directory, which might possibly tell us more about the root cause of this.Googling that error, I came across this thread, which has a couple suggestions you could try. There's a suggestion to try enabling "Prefer Maximum Performance" under the nvidia-settings PowerMizer section, but honestly I'd try the opposite (I think it's called "Prefer Consistent Performance"). There's also a few other things you can change with nvidia-settings including cranking the fan speed to max or underclocking it to try and get it to be less hot. There's a bunch of info about unlocking hidden settings in nvidia-settings from the crypto mining community. Typically they're trying to overclock things as much as possible and push their cards as hard as they can, but you can use the same settings to try and slow down / cool off a card that's running too hot. Here's one such thread talking about these settings.
On last crash, the CTRL+ALT+F3 key command didn't work to get me a terminal, though I couldn't tell you whether or not caps lock triggered, as I didn't try that. I'll try again next time though. I'd love to be able to get that bug report.
Built into Pop_OS! is a priority switcher for my graphics card, where I can toggle it between "Battery Life", "Balanced", and "High Performance" mode. It defaults to "Balanced" every boot and I usually just leave it there, though occasionally, when I think to, I'll switch it into "High Performance" before gaming. Given what you linked, I wonder if that has something to do with it? I'm going to play around with the game that crashed it (Tesla vs. Lovecraft, in case anyone is interested) in Balanced mode and again in High Performance mode and see if that triggers anything.
I'm actually using Pop!_OS right now too. I still haven't been able to discern what those power options do other than dimming the screen of my laptop. Whereas the nvidia-settings application definitely does let you set power options for your GPU. That might be your salvation if it's an overheating issue.
PS: What model of laptop are you using?
Yeah, I have no idea what they do either? I primarily use that tool not to switch the profile for the graphics card but to switch between my NVIDIA and Intel card, as I get way better battery life through the Intel one (as is expected).
I'm on a System76 Oryx Pro (oryp5).
Oh, that's very nice. And much newer than I expected. Since Pop_OS is fully supported on it and it's likely under warranty, you may want to reach out to System76 for support.
I was sorta under the assumption it was an older laptop, and maybe thermal issues have been slowly cooking the graphics card. It still does seem like a hardware issue to me, but it's pretty hard to diagnose. One thing you could do, which would be a huge hassle, is to install windows on it. If you get the same crashes on Windows, it's definitely a physical problem with your GPU.
If you're convinced it's a driver issue, and it gets worse after certain updates, one thing you can do is when it's booting up, as soon as you see the System76 logo, start tapping the space bar (or pretty much any other key I think.) You'll get a Pop!_OS startup menu, from which you can choose "Pop_OS-oldkern" to boot into your previous kernel, with the previous nvidia driver.
I won't let Windows touch my precious machine! I bought this as a final and definitive step away from that OS!
Switching the kernel is also something I'm also going to try, on top of some of the temperature/power regulation suggestions. I should have time over the coming days to do some more thorough investigation of this.
Thanks for all your help and insight. Your responses have been invaluable. When I opened the thread, I originally just wanted the answer to the question "What's CTRL+ALT+DEL for Linux?" The crashes didn't really bother me, other than as a minor annoyance, but given how much concern there is for a hardware issue, I think I will reach out to System76. I'd rather get any potential issues solved now, while it's still under warranty, than later when it's not.
Listen to @pseudolobster. You're in the rare position of being a Linux desktop user with official support. Take advantage of that!
I was able to trigger the crash again, which means I'm able to pretty reliably replicate it at this point simply by running a particular game (Tesla vs. Lovecraft).
This time I ran the game in windowed mode and kept up Psensor so that I could see the temperatures of things while the game was going. I was able to play for maybe 5 minutes before crashing.
CTRL+ALT+F3
didn't get me to a terminal, so I went withREISUB
.At the point of the crash, my CPU was running at 51°C and my GPU was running at 53°C. Interestingly enough, in the journal there's a message about CPU temperatures being above a threshhold, followed immediately by messages about them being normal. I found messages similar to these during the startup following the crash as well.
Also, something I forgot to mention: about a minute after the crash happens, the fans on my computer go full blast (sounds like it's trying to lift off). Doing the
REISUB
sequence doesn't reset this and they'll continue going full blast. This past crash, I did REISUB before the fans went full blast thinking that would prevent it, but they engaged anyway after I'd logged in after the reboot. To get them to shut off I simply shut down the computer and then do a cold boot.Here's the journal of the crash for anyone interested:
Crash Journal
Next Steps: As pointed out by @pseudolobster and @mrbig, my laptop is still under warranty, so I'll be contacting System76 about this issue. If it's a hardware issue, I'd like to get it taken care of before my warranty goes out.
Because I suspect it might be driver issue, I'm going to try the kernel fallback that @pseudolobster mentioned here. Before I do that, however, is there any way to check the date of my last driver update? I previously put 7 hours into this particular game with no crashes whatsoever about a week ago, and when I picked it back up this week it has now crashed in three out of three tries, so I'd like to know if my driver updated during that time. Is there a log I can check that will tell me this?
My theory there was that there's a chance you are at a terminal, you just can't see it because the video card isn't putting out video. If your capslock key is responsive it's an indication the system is still responsive, and you should be able to log in blindly and run the nvidia bug report script. I can't guarantee that report will contain anything useful, but it'd at least give us something to work with. The logs you have don't say much, since from the kernel's perspective your video card has simply disappeared.
Drivers in linux are typically in the form of kernel modules, so you'd be looking for whatever the last update to your kernel was. From my experience Pop_OS is pretty bad about removing old kernels, so you should still have all your old ones laying around in /boot/. By looking at the timestamps, you can see when they were installed.
In my case, I've got three old kernels in my /boot/ directory, so when I type in a terminal
ls -l /boot
I get the following:I imagine yours will look similar, and from there you can see my kernel was updated on Oct 18, Oct 28, and Nov 13.
I forgot to check the caps lock key! Sorry! I'll do that my next go around, as well as identify exactly what I'll need to type blindly into the terminal in order to produce that log.
Running
ls -l /boot
shows the same series of kernels as you, but are the times listed actually installation times? I ask only because our timestamps match to the minute for each one (the hours are different, but they're all offset by the same amount, so I assume that's because of timezone differences?).Hmm, I always sorta assumed that would be the installation time, but I guess it's the date the kernel was built.
Your update history is in /var/log/apt/history.log. If you were to do
sudo gedit /var/log/apt/history.log
you should get a list of the exact dates and times you updated, and what was updated each time. Packages that could be affecting this are things likexserver-xorg-video-nvidia
andlinux-image
.To produce the log, you'd want to try the following:
CTRL+ALT+F3
yourusername
<enter>yourpassword
<enter>sudo nvidia-bug-report.sh
<enter>yourpassword
<enter>(wait 15 seconds)
reboot
<enter>You continue to be amazingly helpful! Thank you so much!
So I checked my
history.log
using your instructions and it looks likelinux-image
andxserver-xorg-video-nvidia
last got updated on November 16th. I then checked Steam, and that was the day that I started playing the game, so I think my theory about it being a driver issue might be out, as I was able to play it just fine since the 16th multiple times. I did another big update on the 24th, which was the update I was thinking of when I suspected it might have been a driver, but that was mostlylib*
packages and has seemingly nothing related to the kernel or my GPU (though, admittely, I don't know most of what I'm looking at).Also, just so you know, don't feel obligated to continue to help me diagnosing this unlesss you want to! While I welcome your help and am incredibly grateful for it, I don't want you to feel like you're on the hook for anything. I've already messaged System76 with a lot of the information here, though it might take a while for them to get back to me on account of the holiday and weekend.
No worries! I enjoy helping people, especially those who are so eager to learn.
Let me know if you need any more help!
Final Update:
System76 ended up replacing the motherboard (under warranty). I received my laptop back today and will test it out in the coming days to make sure everything is running as intended. Thanks to everyone here who helped me! Without your assistance I would have just put up with the issues and likely ended up with a very expensive lemon of a laptop.