12 votes

[SOLVED] Tech support request: Recovering from hard crashes in Linux

EDIT: Latest update


This is something so rudimentary that I'm a little embarrassed to ask, but I've also tried looking around online to no avail. One of the hard parts about being a Linux newbie is that the amount of support material out there seems to differ based on distro, DE, and also time, so posts from even a year or two ago can be outdated or inapplicable.

Here's my situation: I'm a newbie Linux user running Pop!_OS 19.10 with the GNOME desktop environment. Occasionally, games I'm playing will hard crash and lock up my system completely, leaving a still image of the game frozen on the screen indefinitely. The system stays there, completely unresponsive to seemingly any inputs. It doesn't happen often, but when it does it's almost always when I'm running a Windows game through Steam's Proton layer. I suspect it also might have something to do with graphics drivers, as I'll at times notice an uptick in frequency after certain updates, though that might just be me finding a suspicious pattern where none exists.

Anyway, what I don't know how to do is gracefully exit or recover from these crashes. No keyboard shortcut seems to work, and I end up having to hold the power button on my computer until it abruptly shuts off. This seems to be the "worse case scenario" for handling it, so if there is a better way I should go about this, I'd love to know about it.


EDIT: I really want to thank everyone for their help so far. My initial question has been answered, and for posterity's sake I'd like to post the solution here, to anyone who is searching around for this same issue and ends up in this thread:

  • Use CTRL+ALT+F3/F4/F5/F6 keys to access a terminal, where you can try to kill any offending processes and reboot if needed.
  • If that fails, use ALT+SYSRQ+R-E-I-S-U-B.

With that out of the way, I've added more information about the crashes specifically to the thread, primarily here, and some people are helping me out with diagnosing the issue. This thread is now less about the proper way to deal with the crash than it is about trying to identify the cause of the crash and prevent it in the first place.

28 comments

  1. [5]
    whbboyd
    Link
    Usually—but not always—when the graphics freeze like that, the cause is a kernel panic. (Proprietary graphics drivers are a prime source of panics, too.) It's not possible to recover from a panic;...
    • Exemplary

    Usually—but not always—when the graphics freeze like that, the cause is a kernel panic. (Proprietary graphics drivers are a prime source of panics, too.) It's not possible to recover from a panic; it's a "things have gone too badly to know how to recover" response and the kernel intentionally halts the scheduler, prints a message to the console (which unfortunately you can't see in graphics mode) and enters an infinite HLT loop, forcing the system to be hard-rebooted.

    You can confirm that the crashes are kernel panics by enabling kernel crash dumps (here are instructions for Ubuntu which I assume would also apply to Pop!) and seeing if a crash dump is generated and if it indicates a panic. Unfortunately, graphics-driver driven panics aren't really actionable (unless you're in the extremely enviable position of being able to compel Nvidia or AMD to fix or open source their drivers…).

    If the crashes aren't panics, look into enabling and using the Magic SysRq key to recover or at least cleanly shut down.

    14 votes
    1. arghdos
      Link Parent
      I think that AMD does have open source versions of their drivers: https://wiki.archlinux.org/index.php/AMDGPU Haven't used them personally though, so no idea how well they work (or not).

      unless you're in the extremely enviable position of being able to compel Nvidia or AMD to fix or open source their drivers…

      I think that AMD does have open source versions of their drivers:

      https://wiki.archlinux.org/index.php/AMDGPU

      Haven't used them personally though, so no idea how well they work (or not).

      8 votes
    2. [3]
      kfwyre
      Link Parent
      Response to both you and @pseudolobster: Thank you! I appreciate your helpful instructions, not just for what to do but for the surrounding framework for understanding them as well. The...

      Response to both you and @pseudolobster:

      Thank you! I appreciate your helpful instructions, not just for what to do but for the surrounding framework for understanding them as well.

      The instructions for enabling kernel crash dumps look pretty intensive, so I'm going to start with the console and Magic SysRq key and try those out the next time it happens. Interestingly, my SysRq default value is "176" instead of "1" which, according to this limits a lot of what it can do with it? I tried REISUB and nothing visibly happened until the B, at which point my computer restarted. I don't know if REISU actually accomplished anything in the background when I did it, or if it's disabled by my default value. Let me know if I should change that to 1.

      I also tried the CTRL+ALT+F3 key command, which did get me to a terminal, so hopefully I can use that for recovery as well. I'll post back here the next time it happens with the results, but that might not be for a couple of days/weeks (there's no telling when it will happen again -- sometimes it's multiple times a week, others it's smooth for weeks at a time).

      6 votes
      1. [2]
        pseudolobster
        Link Parent
        Unfortunately I don't think you can change that without recompiling the kernel. According to that thread, the S, U, and B features all work, which ought to be enough to finish writing things to...

        Let me know if I should change that to 1.

        Unfortunately I don't think you can change that without recompiling the kernel. According to that thread, the S, U, and B features all work, which ought to be enough to finish writing things to disk if it's in the middle of something, and will at least prevent major disk corruption (which already isn't as big a concern as it used to be before we used journaling filesystems.)

        8 votes
        1. kfwyre
          Link Parent
          Ah, good to know. At the very least it should be better than holding down the power button!

          Ah, good to know. At the very least it should be better than holding down the power button!

          5 votes
  2. pseudolobster
    (edited )
    Link
    There's a couple emergency keys you can use in linux to recover from crashes. For lockups where your input is frozen you can try CTRL+ALT+F# to switch to another console. On Pop!_OS, console 1 is...

    There's a couple emergency keys you can use in linux to recover from crashes. For lockups where your input is frozen you can try CTRL+ALT+F# to switch to another console. On Pop!_OS, console 1 is your login screen and 2 is your desktop, but 3 through 6 are terminals. If you hit CTRL+ALT+F3 you'll get a login, from there you can use a command like ps aux or top to find the offending process and kill it. You might also be able to just killall steam or killall proton or something like that.

    If it's more severe than that, like your kernel has actually locked up due to a driver issue, the only tool you may have at your disposal is the "Magic SysReq" key, which won't let you kill the process or save your work, but it'll at least cleanly shut down the system. Basically the printscreen key is also known as the SysReq key, and it has magic powers to talk directly to the kernel. By pressing Alt+PrintScreen and a letter, you can tell the kernel to do things like close all open files, sync the disks, and restart. One mnemonic people generally use for the shutdown sequence is "Raising Skinny Elephants Is Utterly Boring", so you hold down Alt+PrintScreen and type REISUB. More info on that here.


    Edit: I think I just realized now, after 20 years, that "raising skinny elephants..." doesn't actually work as a mnemonic. The "S" is in the wrong place. The other mnemonic I learned was "Reboot Even If System Utterly Broken", which matches "REISUB".

    The "S" though, is to "sync the disks", ie: flush the disk buffers, which some people recommend you do multiple times to ensure it actually worked. So, maybe RSEISUB isn't bad practice. Maybe throw some extra S's in there for good measure. Can't hurt.

    10 votes
  3. mat
    Link
    In addition to @whbboyd 's excellent answer, can I just say in reference to this: Nah, don't be embarrassed. Nothing wrong with asking. Also you searched first. And you asked a really well formed...

    In addition to @whbboyd 's excellent answer, can I just say in reference to this:

    This is something so rudimentary that I'm a little embarrassed to ask

    Nah, don't be embarrassed. Nothing wrong with asking. Also you searched first. And you asked a really well formed question, lots of detail and a clear request at the end.

    10 votes
  4. Amarok
    Link
    Nobody is asking why it's crashing, and this sort of hard lock is unlikely to be a software issue in modern operating systems and modern hardware - this isn't the 90s when all it took to crash...

    Nobody is asking why it's crashing, and this sort of hard lock is unlikely to be a software issue in modern operating systems and modern hardware - this isn't the 90s when all it took to crash anything that wasn't a VAX was a dirty look at the chassis. It's possible you've got a hardware problem of some kind. Gaming puts more stress on a desktop than any other common activity, so if there are issues, it'll bring them out in a system that's otherwise sleeping while doing web browsing or other simple tasks.

    First order of business, leave a RAM checker running overnight and see if it teases out any memory issues. I don't think it's likely since memory problems rarely manifest the exact same error every time, they are more ephemeral and tend to cause unpredictable hard to reproduce errors - but it's worth checking. If there's a similar health checkup tool available for your graphics card maybe run that too, it has its own memory and processor. Depending on the quality of your BIOS you may be able to enable logging of some information in there that will survive the OS crash.

    Check your fans, make sure they are all clean and functional, make sure the heatsinks are clear of dusty buildup, that your power supply isn't harboring a dust bunny - a simple can of air will do wonders in no time. Heat is not your friend.

    Since it only happens when you are gaming it smells like a temperature or voltage regulation issue to me. There are plenty of programs that can query your mainboard and/or GPU to monitor that information. Wonky voltages can cause exactly this sort of hang in a graphics card. Those can come from the power supply or the mainboard or the card - anywhere a bad capacitor is lurking. The easiest way to rule out the GPU is to pop in any other GPU and stress the system again. If you've got two GPUs (say the built in intel 'gpu' and an added card) it's possible they are tripping over each other, I've run into that issue a couple times. Best to disable the built in graphics, probably with a BIOS setting.

    8 votes
  5. [4]
    spit-evil-olive-tips
    Link
    Something to check / rule out as a cause is any sort of overheating - especially since the issue occurs when your system and particularly the GPU is under heavy load. Does your case have adequate...

    Something to check / rule out as a cause is any sort of overheating - especially since the issue occurs when your system and particularly the GPU is under heavy load. Does your case have adequate ventilation and airflow?

    There's several programs that will monitor your CPU & GPU temperatures in real time. If you have an Nvidia GPU, for example, nvidia-settings will show you the temperature of the GPU. You can try that both while the system is otherwise idle, as well as when playing a game (easier if the game is windowed vs full-screen, obviously).

    The next time the crash happens, note the time, then after you reboot open a terminal window and run journalctl -r. That gives you system logs in reverse-chronological order, so you can scroll to the time of the crash. Depending on the exact nature of the crash there might be nothing in the logs at that time, or there might be a smoking gun that points to the exact cause (such as a message from your GPU driver saying "shutting down due to overheating").

    Another factor, if you have an Nvidia card, is whether you're using the official, closed-source Nvidia drivers, or the open-source Nouveau drivers. Switching between them might help rule out a driver bug vs. something else.

    7 votes
    1. [3]
      kfwyre
      (edited )
      Link Parent
      It's a laptop with pretty terrible heat dispersion, but I've got it living on a pretty powerful fan. I don't think it's heat-caused because, when it crashes, it usually happens early on, shortly...

      It's a laptop with pretty terrible heat dispersion, but I've got it living on a pretty powerful fan. I don't think it's heat-caused because, when it crashes, it usually happens early on, shortly after loading the game (usually within the first minute or so). I'm using an NVIDIA card with the proprietary drivers, which I suspect is the cause since I've read many complaints of crashes with them.

      A crash happened shortly before I made this post, so I dug through journalctl -r for a bit (thanks for that tip, by the way!) and think I found where it happened, though I could be wrong (I polluted the journal with reboots trying out the different things suggested in this thread, as well as other things online, so it's hard to know if this particular one is the one where it crashed mid-game).

      Shortly before the lockup, it gave several GPU has fallen off the bus messages, followed by:

      A GPU crash dump has been created. If possible, please run nvidia-bug-report.sh as root to collect this data before the NVIDIA kernel module is unloaded.

      I'm assuming I can't do that since it was presumably unloaded when I restarted. There are also two lines in red, which I'm assuming means they're important?

      PCI post-resume error -19!
      HC died; cleaning up

      Anyway, I throw these out here not because you're on the hook for diagnosing anything but just because it might be worthwhile to someone who knows what's going on.

      6 votes
      1. [2]
        Soptik
        Link Parent
        Did any of the crashes happen when your laptop was charging and didn’t live of battery? Also, can you reproduce the crashes, or does it crash randomly? Do some games work?

        when it crashes, it usually happens early on, shortly after loading the game

        Did any of the crashes happen when your laptop was charging and didn’t live of battery? Also, can you reproduce the crashes, or does it crash randomly? Do some games work?

        5 votes
        1. kfwyre
          (edited )
          Link Parent
          My laptop is essentially a desktop replacement, so it's almost always on the charger, so I'd say all of them (to the best of my memory) have happened while charging rather than on battery. Also,...

          My laptop is essentially a desktop replacement, so it's almost always on the charger, so I'd say all of them (to the best of my memory) have happened while charging rather than on battery. Also, whenever it's off the charger I'm usually not gaming on it, and I don't think it's ever triggered when I'm not gaming (though again, this is to the best of my memory).

          I'm not able to reproduce the crashes, as they seem to occur randomly. The closest I've come to reproducing it was that I could get it to reliably fail within the first minute of loading up The Witness. That's actually the reason I think it's related to drivers. I put 20+ hours into The Witness on this computer with no issues, and with multiple play sessions lasting multiple hours without a crash. After a graphics card driver update, I started getting these hard crashes almost every time I booted into the game. I would get a few seconds of playability and then it would hard lock.

          I initially blamed the crashes on Proton, thinking that it was a regression in one of their updates, but I tried rolling back to different versions of that and the crashes persisted. The issue wasn't resolved until after I updated my graphics card drivers again. The crashes then stopped and the game was fully playable again.

          6 votes
  6. kfwyre
    Link
    Update: to keep those in the loop that might be interested, I've been in communication with System76's support team. They had me do some tests and confirmed that it's not a driver issue but a...

    Update: to keep those in the loop that might be interested, I've been in communication with System76's support team. They had me do some tests and confirmed that it's not a driver issue but a hardware issue. Given that the laptop is still under warranty, they're going to replace the graphics card for me.

    I want to thank everyone here that helped me, and especially those of you that clued me in to the idea that it was a more serious hardware issue than I thought. Without your input it's unlikely that I'd be getting this replaced under warranty, as I would have just ignored the issue until it got worse, likely after the warranty expired.

    7 votes
  7. [14]
    kfwyre
    (edited )
    Link
    Alright, so it happened again, this time in a native Linux game. This is now twice in two days, and I think I recently updated my graphics driver in the past week, though I'm not sure on that...

    Alright, so it happened again, this time in a native Linux game. This is now twice in two days, and I think I recently updated my graphics driver in the past week, though I'm not sure on that point.

    Anyway, I first tried CTRL+ALT+F3/F4/F5/F6 to get a terminal, but none of those worked, so I then did ALT+SYSRQ+REISUB which successfully rebooted the computer.

    In case anyone's interested, here's what journalctl -r looks like just before and during the crash (also I'm not entirely sure what I've shared here, so if any of the information here is sensitive, identifying, or compromising, please let me know so I can edit it out):

    Crash Journal
    -- Reboot --
    Nov 28 XX:15:09 zen kernel: sysrq: Emergency Remount R/O
    Nov 28 XX:15:08 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00001e34, 0x00002c54)
    Nov 28 XX:15:05 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00001e34, 0x00002b0c)
    Nov 28 XX:15:02 zen kernel: Emergency Sync complete
    Nov 28 XX:15:02 zen kernel: sysrq: Emergency Sync
    Nov 28 XX:14:59 zen kernel: sysrq: This sysrq operation is disabled.
    Nov 28 XX:14:58 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00001e34, 0x00002b0c)
    Nov 28 XX:14:56 zen kernel: sysrq: This sysrq operation is disabled.
    Nov 28 XX:14:55 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00001e34, 0x000029c4)
    Nov 28 XX:14:54 zen kernel: sysrq: This sysrq operation is disabled.
    Nov 28 XX:14:48 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00001e34, 0x000029c4)
    Nov 28 XX:14:45 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00001e34, 0x0000287c)
    Nov 28 XX:14:38 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00001e34, 0x0000287c)
    Nov 28 XX:14:35 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00001e34, 0x00002734)
    Nov 28 XX:14:28 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00001e34, 0x00002734)
    Nov 28 XX:14:25 zen /usr/lib/gdm3/gdm-x-session[1991]: (WW) NVIDIA: Wait for channel idle timed out.
    Nov 28 XX:14:20 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00001e34, 0x000024ac)
    Nov 28 XX:14:19 zen kernel: Asynchronous wait on fence NVIDIA:nvidia.prime:1e9a0f timed out (hint:intel_atomic_commit_ready+0x0/0x58 [i915])
    Nov 28 XX:14:13 zen /usr/lib/gdm3/gdm-x-session[1991]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00001e34, 0x000024ac)
    Nov 28 XX:14:08 zen kernel: Asynchronous wait on fence NVIDIA:nvidia.prime:1e9a0e timed out (hint:intel_atomic_commit_ready+0x0/0x58 [i915])
    Nov 28 XX:13:58 zen kernel: xhci_hcd 0000:01:00.2: HC died; cleaning up
    Nov 28 XX:13:58 zen kernel: xhci_hcd 0000:01:00.2: PCI post-resume error -19!
    Nov 28 XX:13:58 zen kernel: xhci_hcd 0000:01:00.2: Controller not ready at resume -19
    Nov 28 XX:13:58 zen kernel: xhci_hcd 0000:01:00.2: Refused to change power state, currently in D3
    Nov 28 XX:13:58 zen kernel: xhci_hcd 0000:01:00.2: Refused to change power state, currently in D3
    Nov 28 XX:13:57 zen kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                NVRM: nvidia-bug-report.sh as root to collect this data before
                                NVRM: the NVIDIA kernel module is unloaded.
    Nov 28 XX:13:57 zen kernel: [128B blob data]
    Nov 28 XX:13:57 zen kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
    Nov 28 XX:13:57 zen kernel: NVRM: Xid (PCI:0000:01:00): 79, pid=298, GPU has fallen off the bus.
    Nov 28 XX:13:57 zen kernel: [118B blob data]
    Nov 28 XX:13:57 zen kernel: NVRM: GPU at PCI:0000:01:00: GPU-535163a0-ee59-8ba7-a5ac-8df087621777
    
    5 votes
    1. [7]
      pseudolobster
      Link Parent
      There doesn't seem to be anything particularly sensitive in that. So the "GPU has fallen off the bus" error means the video card has completely stopped talking to the system. This could be that...

      There doesn't seem to be anything particularly sensitive in that.

      So the "GPU has fallen off the bus" error means the video card has completely stopped talking to the system. This could be that it's faulty, it's overheated, or possibly some other reason. But it got me curious. It seems like the rest of the system is still functioning, but you obviously can't see anything on the screen. I wonder if at that point the system is still responsive.

      If this happens again, what I'd be inclined to try is hitting the caps lock key and seeing if it lights up. If the system is still responsive, one thing you could try is hitting CTRL+ALT+F3, type your username, enter, your password, enter, then type sudo nvidia-bug-report.sh, enter, your password, enter, then wait ten or fifteen seconds. After you reboot (btw at this point if your system is still responsive you could just type reboot to reboot), you may find a file called nvidia-bug-report.log.gz in your home directory, which might possibly tell us more about the root cause of this.

      Googling that error, I came across this thread, which has a couple suggestions you could try. There's a suggestion to try enabling "Prefer Maximum Performance" under the nvidia-settings PowerMizer section, but honestly I'd try the opposite (I think it's called "Prefer Consistent Performance"). There's also a few other things you can change with nvidia-settings including cranking the fan speed to max or underclocking it to try and get it to be less hot. There's a bunch of info about unlocking hidden settings in nvidia-settings from the crypto mining community. Typically they're trying to overclock things as much as possible and push their cards as hard as they can, but you can use the same settings to try and slow down / cool off a card that's running too hot. Here's one such thread talking about these settings.

      7 votes
      1. [6]
        kfwyre
        Link Parent
        On last crash, the CTRL+ALT+F3 key command didn't work to get me a terminal, though I couldn't tell you whether or not caps lock triggered, as I didn't try that. I'll try again next time though....

        On last crash, the CTRL+ALT+F3 key command didn't work to get me a terminal, though I couldn't tell you whether or not caps lock triggered, as I didn't try that. I'll try again next time though. I'd love to be able to get that bug report.

        Built into Pop_OS! is a priority switcher for my graphics card, where I can toggle it between "Battery Life", "Balanced", and "High Performance" mode. It defaults to "Balanced" every boot and I usually just leave it there, though occasionally, when I think to, I'll switch it into "High Performance" before gaming. Given what you linked, I wonder if that has something to do with it? I'm going to play around with the game that crashed it (Tesla vs. Lovecraft, in case anyone is interested) in Balanced mode and again in High Performance mode and see if that triggers anything.

        4 votes
        1. [5]
          pseudolobster
          Link Parent
          I'm actually using Pop!_OS right now too. I still haven't been able to discern what those power options do other than dimming the screen of my laptop. Whereas the nvidia-settings application...

          I'm actually using Pop!_OS right now too. I still haven't been able to discern what those power options do other than dimming the screen of my laptop. Whereas the nvidia-settings application definitely does let you set power options for your GPU. That might be your salvation if it's an overheating issue.

          PS: What model of laptop are you using?

          4 votes
          1. [4]
            kfwyre
            Link Parent
            Yeah, I have no idea what they do either? I primarily use that tool not to switch the profile for the graphics card but to switch between my NVIDIA and Intel card, as I get way better battery life...

            Yeah, I have no idea what they do either? I primarily use that tool not to switch the profile for the graphics card but to switch between my NVIDIA and Intel card, as I get way better battery life through the Intel one (as is expected).

            I'm on a System76 Oryx Pro (oryp5).

            5 votes
            1. [2]
              pseudolobster
              Link Parent
              Oh, that's very nice. And much newer than I expected. Since Pop_OS is fully supported on it and it's likely under warranty, you may want to reach out to System76 for support. I was sorta under the...

              Oh, that's very nice. And much newer than I expected. Since Pop_OS is fully supported on it and it's likely under warranty, you may want to reach out to System76 for support.

              I was sorta under the assumption it was an older laptop, and maybe thermal issues have been slowly cooking the graphics card. It still does seem like a hardware issue to me, but it's pretty hard to diagnose. One thing you could do, which would be a huge hassle, is to install windows on it. If you get the same crashes on Windows, it's definitely a physical problem with your GPU.

              If you're convinced it's a driver issue, and it gets worse after certain updates, one thing you can do is when it's booting up, as soon as you see the System76 logo, start tapping the space bar (or pretty much any other key I think.) You'll get a Pop!_OS startup menu, from which you can choose "Pop_OS-oldkern" to boot into your previous kernel, with the previous nvidia driver.

              9 votes
              1. kfwyre
                Link Parent
                I won't let Windows touch my precious machine! I bought this as a final and definitive step away from that OS! Switching the kernel is also something I'm also going to try, on top of some of the...

                I won't let Windows touch my precious machine! I bought this as a final and definitive step away from that OS!

                Switching the kernel is also something I'm also going to try, on top of some of the temperature/power regulation suggestions. I should have time over the coming days to do some more thorough investigation of this.

                Thanks for all your help and insight. Your responses have been invaluable. When I opened the thread, I originally just wanted the answer to the question "What's CTRL+ALT+DEL for Linux?" The crashes didn't really bother me, other than as a minor annoyance, but given how much concern there is for a hardware issue, I think I will reach out to System76. I'd rather get any potential issues solved now, while it's still under warranty, than later when it's not.

                7 votes
            2. mrbig
              Link Parent
              Listen to @pseudolobster. You're in the rare position of being a Linux desktop user with official support. Take advantage of that!

              you may want to reach out to System76 for support

              Listen to @pseudolobster. You're in the rare position of being a Linux desktop user with official support. Take advantage of that!

              9 votes
    2. [6]
      kfwyre
      Link Parent
      I was able to trigger the crash again, which means I'm able to pretty reliably replicate it at this point simply by running a particular game (Tesla vs. Lovecraft). This time I ran the game in...

      I was able to trigger the crash again, which means I'm able to pretty reliably replicate it at this point simply by running a particular game (Tesla vs. Lovecraft).

      This time I ran the game in windowed mode and kept up Psensor so that I could see the temperatures of things while the game was going. I was able to play for maybe 5 minutes before crashing. CTRL+ALT+F3 didn't get me to a terminal, so I went with REISUB.

      At the point of the crash, my CPU was running at 51°C and my GPU was running at 53°C. Interestingly enough, in the journal there's a message about CPU temperatures being above a threshhold, followed immediately by messages about them being normal. I found messages similar to these during the startup following the crash as well.

      Also, something I forgot to mention: about a minute after the crash happens, the fans on my computer go full blast (sounds like it's trying to lift off). Doing the REISUB sequence doesn't reset this and they'll continue going full blast. This past crash, I did REISUB before the fans went full blast thinking that would prevent it, but they engaged anyway after I'd logged in after the reboot. To get them to shut off I simply shut down the computer and then do a cold boot.

      Here's the journal of the crash for anyone interested:

      Crash Journal
      -- Reboot --
      Nov 29 XX:06:01 zen kernel: sysrq: Emergency Remount R/O
      Nov 29 XX:05:57 zen kernel: Emergency Sync complete
      Nov 29 XX:05:57 zen kernel: sysrq: Emergency Sync
      Nov 29 XX:05:54 zen kernel: sysrq: This sysrq operation is disabled.
      Nov 29 XX:05:52 zen kernel: sysrq: This sysrq operation is disabled.
      Nov 29 XX:05:49 zen kernel: sysrq: This sysrq operation is disabled.
      Nov 29 XX:05:26 zen kernel: mce: CPU9: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU3: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU10: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU4: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU5: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU11: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU7: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU1: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU0: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU6: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU8: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU2: Package temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU8: Core temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU2: Core temperature/speed normal
      Nov 29 XX:05:26 zen kernel: mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU8: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU9: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU11: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU10: Package temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU8: Core temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:26 zen kernel: mce: CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
      Nov 29 XX:05:25 zen io.elementary.appcenter-daemon.desktop[3015]: [2019-11-29T19:05:25] [ERR] nvctrl: Failed to retrieve measure of type 110204 for NVIDIA GPU 0
      Nov 29 XX:05:25 zen io.elementary.appcenter-daemon.desktop[3015]: [2019-11-29T19:05:25] [ERR] nvctrl: Failed to retrieve measure of type 210204 for NVIDIA GPU 0
      Nov 29 XX:05:25 zen io.elementary.appcenter-daemon.desktop[3015]: [2019-11-29T19:05:25] [ERR] nvctrl: Failed to retrieve measure of type 90204 for NVIDIA GPU 0
      Nov 29 XX:05:25 zen kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                  NVRM: nvidia-bug-report.sh as root to collect this data before
                                  NVRM: the NVIDIA kernel module is unloaded.
      Nov 29 XX:05:25 zen kernel: [128B blob data]
      Nov 29 XX:05:25 zen kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
      Nov 29 XX:05:25 zen kernel: NVRM: Xid (PCI:0000:01:00): 79, pid=283, GPU has fallen off the bus.
      Nov 29 XX:05:25 zen kernel: [118B blob data]
      Nov 29 XX:05:25 zen kernel: NVRM: GPU at PCI:0000:01:00: GPU-535163a0-ee59-8ba7-a5ac-8df087621777
      Nov 29 XX:05:25 zen io.elementary.appcenter-daemon.desktop[3015]: [2019-11-29T19:05:25] [ERR] nvctrl: Failed to retrieve measure of type 50204 for NVIDIA GPU 0
      Nov 29 XX:05:25 zen kernel: xhci_hcd 0000:01:00.2: HC died; cleaning up
      Nov 29 XX:05:25 zen kernel: xhci_hcd 0000:01:00.2: PCI post-resume error -19!
      Nov 29 XX:05:25 zen kernel: xhci_hcd 0000:01:00.2: Controller not ready at resume -19
      Nov 29 XX:05:25 zen kernel: xhci_hcd 0000:01:00.2: Refused to change power state, currently in D3
      Nov 29 XX:05:24 zen kernel: xhci_hcd 0000:01:00.2: Refused to change power state, currently in D3
      Nov 29 XX:04:31 zen wpa_supplicant[1211]: wlp0s20f3: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-59 noise=9999 txrate=144400
      

      Next Steps: As pointed out by @pseudolobster and @mrbig, my laptop is still under warranty, so I'll be contacting System76 about this issue. If it's a hardware issue, I'd like to get it taken care of before my warranty goes out.

      Because I suspect it might be driver issue, I'm going to try the kernel fallback that @pseudolobster mentioned here. Before I do that, however, is there any way to check the date of my last driver update? I previously put 7 hours into this particular game with no crashes whatsoever about a week ago, and when I picked it back up this week it has now crashed in three out of three tries, so I'd like to know if my driver updated during that time. Is there a log I can check that will tell me this?

      6 votes
      1. [5]
        pseudolobster
        Link Parent
        My theory there was that there's a chance you are at a terminal, you just can't see it because the video card isn't putting out video. If your capslock key is responsive it's an indication the...

        CTRL+ALT+F3 didn't get me to a terminal, so I went with REISUB.

        My theory there was that there's a chance you are at a terminal, you just can't see it because the video card isn't putting out video. If your capslock key is responsive it's an indication the system is still responsive, and you should be able to log in blindly and run the nvidia bug report script. I can't guarantee that report will contain anything useful, but it'd at least give us something to work with. The logs you have don't say much, since from the kernel's perspective your video card has simply disappeared.

        is there any way to check the date of my last driver update?

        Drivers in linux are typically in the form of kernel modules, so you'd be looking for whatever the last update to your kernel was. From my experience Pop_OS is pretty bad about removing old kernels, so you should still have all your old ones laying around in /boot/. By looking at the timestamps, you can see when they were installed.

        In my case, I've got three old kernels in my /boot/ directory, so when I type in a terminal ls -l /boot I get the following:

        -rw------- 1 root root 11391736 Oct 18 02:18 vmlinuz-5.3.0-19-generic
        -rw------- 1 root root 11393920 Oct 28 17:20 vmlinuz-5.3.0-20-generic
        -rw------- 1 root root 11398016 Nov 13 08:37 vmlinuz-5.3.0-22-generic
        

        I imagine yours will look similar, and from there you can see my kernel was updated on Oct 18, Oct 28, and Nov 13.

        6 votes
        1. [4]
          kfwyre
          Link Parent
          I forgot to check the caps lock key! Sorry! I'll do that my next go around, as well as identify exactly what I'll need to type blindly into the terminal in order to produce that log. Running ls -l...

          I forgot to check the caps lock key! Sorry! I'll do that my next go around, as well as identify exactly what I'll need to type blindly into the terminal in order to produce that log.

          Running ls -l /boot shows the same series of kernels as you, but are the times listed actually installation times? I ask only because our timestamps match to the minute for each one (the hours are different, but they're all offset by the same amount, so I assume that's because of timezone differences?).

          4 votes
          1. [3]
            pseudolobster
            Link Parent
            Hmm, I always sorta assumed that would be the installation time, but I guess it's the date the kernel was built. Your update history is in /var/log/apt/history.log. If you were to do sudo gedit...

            Hmm, I always sorta assumed that would be the installation time, but I guess it's the date the kernel was built.

            Your update history is in /var/log/apt/history.log. If you were to do sudo gedit /var/log/apt/history.log you should get a list of the exact dates and times you updated, and what was updated each time. Packages that could be affecting this are things like xserver-xorg-video-nvidia and linux-image.

            To produce the log, you'd want to try the following:

            CTRL+ALT+F3
            yourusername <enter>
            yourpassword <enter>
            sudo nvidia-bug-report.sh <enter>
            yourpassword <enter>
            (wait 15 seconds)
            reboot <enter>

            6 votes
            1. [2]
              kfwyre
              Link Parent
              You continue to be amazingly helpful! Thank you so much! So I checked my history.log using your instructions and it looks like linux-image and xserver-xorg-video-nvidia last got updated on...

              You continue to be amazingly helpful! Thank you so much!

              So I checked my history.log using your instructions and it looks like linux-image and xserver-xorg-video-nvidia last got updated on November 16th. I then checked Steam, and that was the day that I started playing the game, so I think my theory about it being a driver issue might be out, as I was able to play it just fine since the 16th multiple times. I did another big update on the 24th, which was the update I was thinking of when I suspected it might have been a driver, but that was mostly lib* packages and has seemingly nothing related to the kernel or my GPU (though, admittely, I don't know most of what I'm looking at).

              Also, just so you know, don't feel obligated to continue to help me diagnosing this unlesss you want to! While I welcome your help and am incredibly grateful for it, I don't want you to feel like you're on the hook for anything. I've already messaged System76 with a lot of the information here, though it might take a while for them to get back to me on account of the holiday and weekend.

              6 votes
              1. pseudolobster
                Link Parent
                No worries! I enjoy helping people, especially those who are so eager to learn. Let me know if you need any more help!

                No worries! I enjoy helping people, especially those who are so eager to learn.

                Let me know if you need any more help!

                6 votes
  8. kfwyre
    Link
    Final Update: System76 ended up replacing the motherboard (under warranty). I received my laptop back today and will test it out in the coming days to make sure everything is running as intended....

    Final Update:

    System76 ended up replacing the motherboard (under warranty). I received my laptop back today and will test it out in the coming days to make sure everything is running as intended. Thanks to everyone here who helped me! Without your assistance I would have just put up with the issues and likely ended up with a very expensive lemon of a laptop.

    1 vote