6 votes

Linux Question: I think my sys m.2 is failing and want to copy my / data for backup via cli

So, I'm using Arch i3wm. I have multiple copies of my /home/username (I am the sole user), and I have a "Spare" drive with media, games, and other goodies, some of which are also stored on partitions on the m.2 in question, but they have backups.

And the reason I ask this question is because while I've had my m.2 fail at the end of '21 (I didn't even know that was a thing, but it barely lasted a year, and things are acting shoddy now... though the original failed without a warning), I just bought a second m.2 for my games. I guess I could swap most of the whole thing over, but I know the boot partition is easier just rebuilt from scratch... which I had to do last week.

Ultimately, what's making me suspicious is when I upgraded to the new drive and unplugged all my non-m.2 satas, I also added some memory and a new power supply. But then after the upgrade (Monday of last week, so the 5th), the system wouldn't boot up. I used a usb to troubleshoot and my /boot partition was apparently no gouda. I redid that, and everything was fine... until this week. Then my new Games partition (basically the new drive) failed fsck and it got stuck in a boot loop on Tuesday. I could boot emergency to root, but not skip the fsck and keep the Games disk auto mounted (I know I changed something to randomize fsck on bootup, but that's something I'm still kinda looking into how I managed...), so I just removed it from my fstab and it booted fine. For two times. I just manually mounted the drive, all was great, then my SO sent me a screenshot today while I was at work stating that my / partition (on the older m.2) apparently rebooted because it bypassed my screen lock, and was stating EXT4-fs error, reading directory lblock:0 and whatnot.

So, that's my history on what's going on, and if anyone can offer any advice [mostly] on the backup stuff, though as I said, /home and the important tangible stuff is saved, but if you also have any input on something more than I suspect the drive is failing (since the /boot partition and now the / partition are crapping out), please feel free to share.

(Also, thanks for letting me in. This is what I'd typically post on reddit and probably have to repost 10 times depending on the sub to get the right keywords and tags and yes, I already searched the internet but my search will not match yours... sigh)

10 comments

  1. vord
    Link
    ddrescue might be the tool you are looking for.

    ddrescue might be the tool you are looking for.

    6 votes
  2. [5]
    CuriosityGobble
    Link
    My first and immediate thought was "partition magic". Copy the entire volume while it's still working, then do a low level diag on your disk. My second thought is that m.2's really don't fail too...

    My first and immediate thought was "partition magic". Copy the entire volume while it's still working, then do a low level diag on your disk.

    My second thought is that m.2's really don't fail too often and what you're seeing is very atypical. You gotta look at root cause here.

    I would look to rule out BIOS, heat, and MoBo issues before continuing much further. It may also be worthwhile to see if there's anything about your distro that could be bad for an m.2/ssd storage, if you don't find any other culprit. It's time to do some digging, I think.

    Best of luck internet friend.

    4 votes
    1. [4]
      Asinine
      Link Parent
      I'm already on the lookout for oddities... temps appear to be good (using corectrl but it only shows gpu temps). The last issue was when I was still at work, so the computer was not being used...

      I'm already on the lookout for oddities... temps appear to be good (using corectrl but it only shows gpu temps). The last issue was when I was still at work, so the computer was not being used (and thus, not under excess stress either for cpu or gpu).

      I've put sensors on watch for now, but things don't look bad - though it looks like the PCI adapter is fluctuating between about 35 and 50 degC (I'd say average around ~42 though). Still, while that looks high to me just for idling, that's not anything I'd worry about...
      Edit: I realized that's the pci the new drive is on. The sys drive is pretty solid at 37.9 degC.

      1. [3]
        CuriosityGobble
        Link Parent
        A million years later.... Did you find a root cause? Was it any of the things that I mentioned? I hope my feedback was helpful. I'm also hoping you didn't need a new controller or system board or...

        A million years later....

        Did you find a root cause? Was it any of the things that I mentioned? I hope my feedback was helpful. I'm also hoping you didn't need a new controller or system board or something.

        If I recall correctly there's some low-level hardware test that the manufacturer built into the hardware that you might be able to run to determine if the issue is located within the drive itself.

        Consider the possibility that you're getting a bunch of pits flipped by a focused shower of neutrinos from outer space, and intergalactic conspiracy.

        1 vote
        1. Asinine
          Link Parent
          Not quite a million.. but hehe. I currently have the m2 on manual mount. I had it on that when the weirdness that prompted me to post was going on, but it hasn't occurred since. Unfortunately, I...

          Not quite a million.. but hehe.
          I currently have the m2 on manual mount. I had it on that when the weirdness that prompted me to post was going on, but it hasn't occurred since. Unfortunately, I tend to stop pursuing problems when they "fix" themselves (see: my car history. See also: it never ends up well...)

          Since originally the boot partition on the original m2 was the issue after the upgrade, then the second partition (/) started acting up, I would likely suspect the mboard. I'm trying to balance finances and so I am just coping at the moment, so I'm opting for your neutrinos theory.

          That being said, I'm curious as to what I should search for with this comment, if you have a good idea?

          If I recall correctly there's some low-level hardware test that the manufacturer built into the hardware that you might be able to run to determine if the issue is located within the drive itself.

        2. Asinine
          Link Parent
          Actually, I am going to chime in again. I can't speak for the original failure drive, but everything is still running well, and temps are notably higher in general, as summer is attempting to be a...

          Actually, I am going to chime in again. I can't speak for the original failure drive, but everything is still running well, and temps are notably higher in general, as summer is attempting to be a thing.
          I had put this secondary m.2 into the second slot, and there was a piece of metal covering it (with a sticky bit on the bottom, which did not look/feel/seem like it was any thermal transferring type of anything, so I pulled that off), which I replaced thinking maybe the metal could be a sort of heat sink. That's when things went sideways and I originally posted this. The metal is now gone, nothing else has changed, so I suspect that heat may have been the issue, just not properly noted.

          Honestly, I guess I'll see in the next year. This thread has made me realize I have a 2+ year old mboard though, so, that might be something I'll be looking to upgrade in the future. Granted, the system plays all my old video games just fine, so we'll see.

  3. [2]
    imperator
    Link
    Have you run smartctl against the drive to see if there are drive errors? If it's not temps, and not user error you may need a new mb. I've recently in Arch had my NAS drive fail to mount upon...

    Have you run smartctl against the drive to see if there are drive errors?

    If it's not temps, and not user error you may need a new mb.

    I've recently in Arch had my NAS drive fail to mount upon boot, but then mounting it in terminal works right away.

    I had boot issues where initmtrfs wasn't generating correctly upon installing a new kernel. Reinstalling the kernel was fixing it. But I haven't had those issues in the later 6.3x kernels

    2 votes
    1. Asinine
      Link Parent
      I just ran smartctl, everything looks good. I'm afraid it is a mboard issue... I'm just trying to check everything else, since I'll likely have it upgrade things if it is.

      I just ran smartctl, everything looks good. I'm afraid it is a mboard issue... I'm just trying to check everything else, since I'll likely have it upgrade things if it is.

  4. [2]
    romeoblade
    Link
    Since you had a drive fail before the power supply swap and now after, I will lean towards it being a motherboard issue. If the machine is custom-built, I would check that the motherboard is on...

    Since you had a drive fail before the power supply swap and now after, I will lean towards it being a motherboard issue. If the machine is custom-built, I would check that the motherboard is on the proper standoff, with the appropriate screws. I had a motherboard (friends machine) years ago that had issues that were due to it not being screwed in correctly and missing standoff's. (A local mom and pop "built" it for my friend) Which was causing shorts. I ended up replacing the motherboard anyways. But was able to prove the issue by switching the case out with a cardboard box. The other thing you could try, if the first m2 failure was a concidence/fluke, is running the machine for a while with the new drives and old power supply. I doubt that's the issue, but if the first failure was unrelated, I've seen bad power supplies cause similar issues.

    As for backups, it might be prudent to get a Backblaze b2 subscription and use something like rclone on a corn schedule to ship your backups offsite. Backblaze is reasonable on price, and they have been around for a while.

    Another option is Borg backup/ssh resync on a schedule to another machine if you have one available. Either way, with the way the machine is failing at the moment, I wouldn't trust any physical drive hooked directly to it for backups. For backups the 3, 2, 1 rule comes to play. 3 copies of your data (your production data and 2 backup copies) on two different media (disk and tape) with one copy off-site for disaster recovery

    1. Asinine
      Link Parent
      I built it myself, and having worked (albeit briefly) in a computer store like that mom and pop one a long while ago, I'm very careful about the standoff screws. That shouldn't be the issue,...

      I built it myself, and having worked (albeit briefly) in a computer store like that mom and pop one a long while ago, I'm very careful about the standoff screws. That shouldn't be the issue, though I might go through and re-check it.

      I typically only used Antec TruePower PS's, but my SO picked up an extra EVGA G5 at the same wattage. I don't see any power issues, but I'm not saying it's not the PS either.

      Honestly, I don't have any data that is so important that I would be destitute if I lost it all. That already happened in 2018... lost all my pictures from living abroad and about 7 years prior, going back to when digital cameras became affordable, and all my napster mp3s and others, videos, docs (some from old BBSs), etc etc etc. Legal stuff is on paper, the stuff I use most is on the gaming clouds, or I can start the game over. I have a few satas that I have backed up to in the past (which I'm referring to) and are not plugged in. So overall, I'm not worried about that. Que sera I suppose.