9 votes

Does reformatting an ext4 partition fix bad sectors, and what are they anyway?

Posted January 12, 2021 by h3x

My Linux desktop is having a bit of difficulty with bad sectors. Lately I've had to boot into recovery and run an fsck a few times to try to fix a problem where the OS drops into read-only mode at the drop of a hat. Today I tried copying some files from one directory to another and got the following error message:
cp: error reading "foo/bar": Input/output error

I've just booted into a live USB and run fsck /dev/sda1 -c and it fixed a load of bad sectors, but the above error message is still happening.

A bit of googling tells me that this is down to bad sectors on the SSD, and I'm not really sure what that means. Is anybody able to enlighten me? And as a follow-up question, would reformatting the hard drive resolve the problem, or are there any other things I can try to fix it?

21 comments

[7]
teaearlgraycold
January 12, 2021
Link
I'm no expert here, but I believe bad sectors are a sign of hardware failure. You should back up any files you want from the drive and buy a new one. Drives can "fix" bad sectors by marking them...

I'm no expert here, but I believe bad sectors are a sign of hardware failure. You should back up any files you want from the drive and buy a new one.

Drives can "fix" bad sectors by marking them unusable. This eats away at your drive's total usable capacity. It may come with a buffer of extra sectors for this use case, but if a lot are becoming unusable you can't trust the drive to reliably do its job.

13 votes
1. [6]
  mrzool
  January 12, 2021
  Link Parent
  I second that. I would back up all the data (actually, you should already have backups ready to go at a moment notice) and swap the drive asap.
  
  I second that. I would back up all the data (actually, you should already have backups ready to go at a moment notice) and swap the drive asap.
  
  4 votes
  1. [5]
    h3x (OP)
    January 12, 2021
    Link Parent
    Fortunately I have everything backed up already, so it’s just a case of replacing the drive under warranty. Just annoyed as the PC is less than a year old!
    
    Fortunately I have everything backed up already, so it’s just a case of replacing the drive under warranty. Just annoyed as the PC is less than a year old!
    
    2 votes
    
    [4]
    Pistos
    January 13, 2021
    Link Parent
    It's annoying, for sure, but these things happen. I have one drive that is probably close to a decade old, with over 88,000 power-on hours: no bad sectors or sectors pending reallocation. Contrast...
    
    It's annoying, for sure, but these things happen. I have one drive that is probably close to a decade old, with over 88,000 power-on hours: no bad sectors or sectors pending reallocation. Contrast that with a drive I got in early 2020, which has less than 6000 power-on hours, but it has over 3000 reallocated sectors, and over 1000 pending reallocation. Luck of the draw, it seems, with these things.
    
    2 votes
    
    [3]
    cfabbro
    January 13, 2021 (edited January 13, 2021)
    Link Parent
    There is always a bit of luck involved, but a drive's listed MTTF/MTBF (mean time to/between failures) is usually a pretty decent way to determine how reliable that drive will be. And Backblaze's...
    
    Luck of the draw, it seems, with these things.
    
    There is always a bit of luck involved, but a drive's listed MTTF/MTBF (mean time to/between failures) is usually a pretty decent way to determine how reliable that drive will be. And Backblaze's HD reports are also a great way to sniff out which brands, drive families, and particular models are failing more often than they should.
    
    p.s. If you need/want longer lasting, more reliable drives, you should always spend a bit more and get yourself Enterprise class drives, since their failure rates are significantly lower than consumer grade drives.
    
    3 votes
    
    [2]
    freestylesno
    January 13, 2021
    Link Parent
    Out of curiosity how do they get lower failure rates? Better build quality and lower tolerances? Or do they bin them some how?
    
    Out of curiosity how do they get lower failure rates? Better build quality and lower tolerances? Or do they bin them some how?
    
    1 vote
    
    cfabbro
    January 13, 2021 (edited January 14, 2021)
    Link Parent
    Build quality is generally higher, and QA testing more stringent. But there are also often mechanical and firmware differences in Enterprise class drives too. E.g. for vibration...
    
    Build quality is generally higher, and QA testing more stringent. But there are also often mechanical and firmware differences in Enterprise class drives too. E.g. for vibration reduction+tolerance, better thermal regulation, support for hot-swapping and error recovery, etc. And as a byproduct of that, the warranties for Enterprise class drives are also typically much longer than consumer grade drives as well (5y vs. 1-3y).
[8]
Pistos
January 12, 2021
Link
As others have said, you appear to be experiencing the symptoms of hardware that is failing due to age/use (which is normal and expected of both SSDs and HDDs). No one else said it yet, so I will:...

As others have said, you appear to be experiencing the symptoms of hardware that is failing due to age/use (which is normal and expected of both SSDs and HDDs). No one else said it yet, so I will:

Install https://www.smartmontools.org/ (with your distro's package manager) and run smartctl with some sensible arguments, like smartctl -a /dev/sda. See man smartctl. This is safe to do while any partitions on it are mounted. It will give you some diagnostic information about the disk -- in particular, number of bad sectors, error rate, disk lifetime, and others. You can also use smartctl to have the drive run some self tests (the quick one should take just 2 or 3 minutes), some of which are safe to run while partitions are mounted.

Once you have actual, hard metrics from smartctl, you can decide whether to take the step of buying a replacement drive. Watching the metrics change from day to day (e.g. number of bad sectors growing) can influence your decision.

Whether you do any of the above, it's always a good idea to have a good backup system in place. I use http://duplicity.nongnu.org/ , which has nice features like incremental backup and interfacing with cloud storage. There's a little bit of a learning curve in terms of setup, but once you have it set up and cronned, I've found it to be essentially set-and-forget.

One more thing: If you find that your drive is in especially poor health, consider using https://www.gnu.org/software/ddrescue/ (again, from your package manager) to do a byte-for-byte copy/backup of the disk. I used this recently with a failing drive, and successfully made a byte-wise clone from the old disk onto a new disk, and it kept all partitions, etc. in place so it was just a matter of tweaking a few things in fstab, etc. and the new disk essentially acted like a drop-in replacement. Of course, the ddrescued copy won't have unrecoverable bytes (files) on it, but the main feature of ddrescue is to salvage data off disks in a less brute-force fashion (i.e. avoid making failing HDDs grind too much trying to get at ruined sectors).

Regarding your question about reformatting: If your intention is to copy the data elsewhere, reformat the partition, and bring the data back, hoping for better drive health, then no, doing that won't improve the physical health of a disk, if that is indeed the problem. Any bad sectors would still remain bad, though the filesystem can be informed about what areas are bad (websearch "ext4 bad blocks"), and avoid them.

7 votes
1. [6]
  dredmorbius
  January 12, 2021
  Link Parent
  I'd prioritise minimising further damage and archiving data over drive diagnosis. Bad sectors indicate a known problem. Mitigate or replace. Diagnostics can be part of the post mortem. Data...
  
  I'd prioritise minimising further damage and archiving data over drive diagnosis.
  
  Bad sectors indicate a known problem. Mitigate or replace. Diagnostics can be part of the post mortem.
  
  Data scrubbing or media destruction should also be priorities.
  
  S.M.A.R.T. is useful but not entirely reliable, and can itself precipitate further disk failure, especially on long tests. Preserve data first.
  
  3 votes
  1. [5]
    Pistos
    January 13, 2021
    Link Parent
    Yes yes, though I would consider the basic info dump to be relatively harmless unless the drive is nearly in tatters.
    
    Yes yes, though I would consider the basic info dump to be relatively harmless unless the drive is nearly in tatters.
    
    2 votes
    
    [4]
    dredmorbius
    January 13, 2021
    Link Parent
    What would you rather have preserved? A short or long S.M.A.R.T. report, or your data? The device is damaged, preserve first, assess after, if still worth determining. Knowing what a bad...
    
    What would you rather have preserved? A short or long S.M.A.R.T. report, or your data?
    
    The device is damaged, preserve first, assess after, if still worth determining.
    
    Knowing what a bad S.M.A.R.T. report looks like is useful. OTOH, one failing drive I had manifested by taking four days to run a report (typically they run in ~1 hr on spinning rust). That's four days more damage to the drive and time to make backups.
    
    1 vote
    
    [3]
    Pistos
    January 13, 2021
    Link Parent
    Hm. We might not be referring to the same thing. On my most-damaged drive (which I consider unsafe for use for anything of worth), smartctl -a finishes in under 5 seconds. Just like you're saying,...
    
    typically they run in ~1 hr on spinning rust
    
    Hm. We might not be referring to the same thing. On my most-damaged drive (which I consider unsafe for use for anything of worth), smartctl -a finishes in under 5 seconds. Just like you're saying, I wouldn't recommend running on an unhealthy disk anything that takes longer than 5 minutes.
    
    2 votes
    
    [2]
    dredmorbius
    January 13, 2021
    Link Parent
    smartctl -t long The -a flag simply reports SMART registers. It does not actually perform a drive test.
    
    smartctl -t long
    
    The -a flag simply reports SMART registers. It does not actually perform a drive test.
    
    Pistos
    January 13, 2021
    Link Parent
    Yes, which is why I was suggesting it as something to do to get some info quickly and without making the disk thrash.
    
    Yes, which is why I was suggesting it as something to do to get some info quickly and without making the disk thrash.
    
    2 votes
2. h3x (OP)
  January 12, 2021
  Link Parent
  This is very helpful, thanks. I think I’ll get in touch with the company I bought the PC from and get the drive replaced. Luckily I already have all my files backed up, and it wouldn’t be terribly...
  
  Regarding your question about reformatting: If your intention is to copy the data elsewhere, reformat the partition, and bring the data back, hoping for better drive health, then no, doing that won't improve the physical health of a disk, if that is indeed the problem. Any bad sectors would still remain bad, though the filesystem can be informed about what areas are bad (websearch "ext4 bad blocks"), and avoid them.
  
  This is very helpful, thanks. I think I’ll get in touch with the company I bought the PC from and get the drive replaced. Luckily I already have all my files backed up, and it wouldn’t be terribly difficult to reinstall software. But this sounds like a good opportunity to learn a new tool, so I’ll give ddrescue a whirl!
  
  3 votes
Nodja
January 12, 2021
Link
As other people said this is most likely an hardware failure. But before you throw the drive away consider that the failure may not be coming from the SSD itself. This is much less unlikely to be...

As other people said this is most likely an hardware failure. But before you throw the drive away consider that the failure may not be coming from the SSD itself. This is much less unlikely to be the case, but worth spending the extra 5 minutes.

If this is a SATA drive, try using another cable and plugging it into another port (a different colored one if available). Faulty cables are pretty common cause for read errors, SATA controller failures less so, but doesn't hurt checking (different color port usually means different controller).

If it's an m.2 drive (SSD that slots directly into the motherboard), then try a secondary m.2 slot. Very unlikely to do anything but again, doesn't hurt.

5 votes
Akir
January 12, 2021
Link
There may be problems with the filesystem itself. You should also run e2fsck with the -p option to check the files as well. Consider running it in verbose mode (-v) to get a better picture of what...

There may be problems with the filesystem itself. You should also run e2fsck with the -p option to check the files as well. Consider running it in verbose mode (-v) to get a better picture of what is happening.

I'm not sure if this applies to SSDs as well, but in HDDs that's generally a sign that it's about to fail. It might not be a bad idea to replace it. At the very least, make sure to do a backup ASAP.

3 votes
[3]
Amarok
January 12, 2021 (edited January 12, 2021)
Link
That's a dead drive for sure. It's not dead yet, but the heads are probably losing coherence or worse, bouncing off the platters. If you listen the drive is probably spinning up and down and/or...

That's a dead drive for sure. It's not dead yet, but the heads are probably losing coherence or worse, bouncing off the platters. If you listen the drive is probably spinning up and down and/or making clicking/pinging noises more than it should as the heads try vainly to stay aligned.

Get your data off that sucker fast and get ready to put in a new drive and reinstall the operating system. There's little point in imaging the OS to transfer it to another drive if it's already suffered from bad sectors and data corruption. Some of the files in the image would be damaged, so your best bet is a fresh install. Consider an SSD for the operating system drive, it'll make a night vs day difference in the performance.

Inside those rotational drives, it's basically a vinyl record in layman terms. Bad sectors are like scratches that destroy the tracks that store the data, and the data already there when it happens. That usually means the heads (like the record needle) are loose and bouncing off the platters.

If you have a screwdriver you can recover some nice strong flesh-eating magnets from the dead drive. Just don't get your finger pinched between them, or you'll find out the flesh eating part is no joke. ;)

I'm not clear on if you already have an SSD or not from your post. If it's an SSD, the memory cells are failing, and that's rather unusual unless you've frozen or cooked the drive somehow, or you got a lemon from the factory. They don't give up the ghost as easily as the rotational drives do, I've never even had one of mine fail yet. If it's still under warranty it's worth sending it back. I'm curious what brand it is.

3 votes
1. [2]
  unknown user
  January 12, 2021
  Link Parent
  An SSD drive is much more likely to fail if it utilises a cheaper data storage schema such as TLC (Triple-Level-Cell) or QLC (Quad-Level Cell). Basically that's a way of storing 3 and 4 bits of...
  
  An SSD drive is much more likely to fail if it utilises a cheaper data storage schema such as TLC (Triple-Level-Cell) or QLC (Quad-Level Cell). Basically that's a way of storing 3 and 4 bits of data per MOSFET in the drive—using 2³ and 2⁴ separate charge states per cell, respectively—and it's a very easy way of massively growing the amount of storage a SSD can hold on the cheap.
  
  Unfortunately that comes at the cost of longevity as well as performance. I wouldn't personally buy any SSD that uses these technologies. Usually the claimed read/write cycle count on Single Level Cell and Dual Level Cell NAND flash memory is good for a long time, though.
  
  6 votes
  1. h3x (OP)
    January 12, 2021
    Link Parent
    This is good to know! I’ll look at the spec for the model that I have and make sure it’s replaced by a more reliable one
    
    This is good to know! I’ll look at the spec for the model that I have and make sure it’s replaced by a more reliable one
    
    2 votes
dredmorbius
January 12, 2021
Link
No it does not. Bad sectors are physical media failure. Formatting only creates the logical filesystem structure. This relies on media integrity. Your drive is failing. Back up data NOW. Power...

No it does not. Bad sectors are physical media failure.

Formatting only creates the logical filesystem structure. This relies on media integrity.

Your drive is failing.

Back up data NOW. Power down the drive to prevent further degradation. Obtain a replacement. Restore to it.

Implement a AND TEST a data backup system.

2 votes