16 votes

ZFS is crud with large pools! Give me some options.

Hey folks

I'm building out a backup NAS to hold approximately 350TB of video. It's a business device. We already own the hardware which is a Gigabyte S451 with twin OS RAIDED SSD but 34 X 16TB SAS HDD disks. I used TrueNAS Scale and ZFS as the filesystem because... It seemed like a good idea. What I didn't realise is that with dedupe and LZO compression on, it would seriously hinder the IO. Anyway, long story short, we filled it as a slave target and it's slow as can be. It errors all the time. I upgraded to 48GB of RAM as ZFS is memory intensive but it's the IOPs which kill it. It's estimating 50 days for a scrub, it's nuts. The system is in RAID6 and all disks dedicated to this pool which is correct in so far as I need all disk available.

Now I know ZFS isn't the way for this device, any ideas here? I'm tempted to go pure Debian, XFS or EXT4 and soft raid. I'd rather it be simple and managed via a GUI for the rest of the team as they're all Windows admins as well as scared of CLI, but I suppose I can give them Webmin at a push.

I'm about ready to destroy over 300TB of copied data to rebuild this as it crashes too often to do anything with, and the Restic backup from the Master just falls over waiting for a response.

Ideas on a postcard (or Tildes reply)...?

UPDATE:

After taking this thread in to consideration, some of my own research and a ZFS calculator, here's what I'm planning - bear in mind this is for an Archive NAS:

36 Disks at 16TB:

9x16TB RAID-Z2
4 x VDEVs = Data Pool
Compression disabled, dedupe disabled.

Raw capacity would be 576TB, but after parity and slop space we're at 422TB usable. In theory, if we call it 70% usable, I'm actually going to cry at 337TB total.

At this moment, the NAS1 server is rocking 293TB of used space. Since I'm using Restic to do the backup of NAS1 to NAS2, I should see some savings, but I can already see that I will need to grow the shelf soon. I'll nuke NAS2, configure this and get the backup rolling ASAP.

My bigger concern now is that NAS1 is set up the same way as NAS2, but never had dedupe enabled. At some point we're going to hit a similar issue.

Thank you for all of your help and input.

16 comments

  1. [2]
    Comment deleted by author
    Link
    1. g33kphr33k
      Link Parent
      Shoot me now. I missed that bit =\

      Shoot me now. I missed that bit =\

      5 votes
  2. Turtle42
    (edited )
    Link
    I don't know if I'd want to give up on ZFS just yet. It's benefits are massive, and ZFS is definitely meant for large pools. Something could have happened with the dedupe and compression settings....

    I don't know if I'd want to give up on ZFS just yet. It's benefits are massive, and ZFS is definitely meant for large pools. Something could have happened with the dedupe and compression settings. I'm not smart enough to know.

    I'd suggest posting on practicalZFS, I'd be really interested to see what Jim Salter (site owner, longtime sysadmin) and Alan Jude (ZFS & BSD developer) from the 2.5 admins podcast suggest before blowing away 300TB. They are the beacons of knowledge for ZFS. Good luck.

    10 votes
  3. [3]
    Greg
    Link
    When you say it’s in RAID6, is that using hardware RAID underneath the ZFS pool? If so, that could be a significant source of overhead: ZFS will be optimising based on what it thinks is the...

    When you say it’s in RAID6, is that using hardware RAID underneath the ZFS pool? If so, that could be a significant source of overhead: ZFS will be optimising based on what it thinks is the physical layout of the blocks, the RAID controller will be presenting the block storage in a way it thinks the OS wants to see, and the mismatch between the two may lead to a lot of extra IOPS being burned.

    A really good (far better than my limited understanding could have managed!) summary on the top answer to this stack overflow question:

    If you're unlucky and you didn't properly align your filesystem partitions, that 128kB block spans two RAID stripes that aren't in cache, and the controller needs to read 14 MB, recompute parity, then write 14 MB. All to write one 128kB block.

    8 votes
    1. [2]
      xk3
      Link Parent
      don't most tools, like sgdisk, align partitions properly ? sudo sgdisk -n 0:0:0 --align-end -t 0:8300 /dev/sdX

      don't most tools, like sgdisk, align partitions properly ?

      sudo sgdisk -n 0:0:0 --align-end -t 0:8300 /dev/sdX
      
      1. Greg
        Link Parent
        As far as I’m aware, even with proper alignment there’s likely to be an overhead involved in reading and calculating parity on an entire stripe every time ZFS writes a block, and an associated...

        As far as I’m aware, even with proper alignment there’s likely to be an overhead involved in reading and calculating parity on an entire stripe every time ZFS writes a block, and an associated increase in cache misses which exacerbates the issue. The part I quoted illustrates the worst case, where the overhead is doubled by spanning two stripes, but even in the average case the slowdown can be significant.

        1 vote
  4. Amarok
    Link
    I've been through this tango before. It's very easy to misread the ZFS documentation when setting up storage pools and configure them in a suboptimal fashion. I still have this boomarked: The...

    I've been through this tango before. It's very easy to misread the ZFS documentation when setting up storage pools and configure them in a suboptimal fashion. I still have this boomarked: The Things Nobody Told You About ZFS. I'll wager that #1 on that list is the reason you're having poor performance - well, that and deduplication, which is never a good idea unless you know for a fact it'll make a difference in your data set, which it almost never will. The compression is always good and should always be on, though. Double check how you set up your vdevs and storage pools against that blog post and I guarantee you'll find the problem and be able to resolve it. The SAN cluster I built at work with ZFS went from 30 second delete operations to 0.001 second delete operations after I set the virtual devices up properly using that blog post as the 'missing ZFS manual'. It can and does scream, right up to the theoretical maximum performance of the hardware once it's set up properly.

    7 votes
  5. [6]
    spit-evil-olive-tips
    Link
    as others have said, dedupe in ZFS is a "here be dragons" feature. it should more or less never be used, outside of some very specific use cases and workloads. I wish they would put it behind a...

    as others have said, dedupe in ZFS is a "here be dragons" feature. it should more or less never be used, outside of some very specific use cases and workloads. I wish they would put it behind a warning or an "I know what I'm doing" feature flag or something, because it's such an attractive nuisance.

    34 X 16TB SAS HDD disks

    another thing to look at, performance wise - are these all in the same vdev? as in, do you have 2 parity drives and 32 data drives?

    that's much wider than is typically recommended - a dozen drives in a vdev is usually the absolute maximum you want. ZFS needs to compute parity across the entire vdev, which will slow it down even once dedupe is off.

    with that many drives, you want a pool made up of multiple vdevs, each a raidz2 (aka raid6) of less than a dozen drives (such as 3 vdevs of 11 drives, plus a warm spare, or 4 of 8 drives plus two spares, etc). this pool:vdev separation is how ZFS is able to scale to large pools, when used correctly.

    if you're on reddit, /r/zfs is one of the few remaining helpful niche subreddits. but the baseline advice they give you will be pretty much what I just said - disable dedupe, and make sure to use smaller vdevs. this will unfortunately require re-creating the entire pool from scratch.

    (and as long as you're doing that, make sure to set the ashift on the pool to 12, because it may have been autodetected as 8 and that would hurt performance even more)

    6 votes
    1. [5]
      g33kphr33k
      Link Parent
      Sounds like I'm nuking from orbit. It was actually a tech working for me that configured these when they arrived, I was hands off and accepted the configs. I'm the person that said dedupe and...

      Sounds like I'm nuking from orbit. It was actually a tech working for me that configured these when they arrived, I was hands off and accepted the configs. I'm the person that said dedupe and compression would be cool so I'm not guilt free, I just expected the tech to do the legwork on the config.

      I don't need any of the advanced features that comes with ZFS, I just need a performant filesystem.

      6 votes
      1. [4]
        vord
        (edited )
        Link Parent
        In which case, you should probably switch to a mirrored vdev configuration. Using parity stripes is often more hassle than it is worth, especially since it is not a replacement for proper backups....

        In which case, you should probably switch to a mirrored vdev configuration. Using parity stripes is often more hassle than it is worth, especially since it is not a replacement for proper backups.

        You'll be losing some storage capacity (only 50% of total purchased disk opposed to 75%), but its much more tolerant of failed disks and performance is minimally impacted when one does. You can resliver exponentially faster, reducing the chances of a 1 disk failure turning into a 2 disk or 3 disk failure.

        6 votes
        1. [3]
          Amarok
          Link Parent
          That's my favorite way to set it up. I've come around to the mindset of 'the hell with parity' because if you eliminate the raidz everything gets dirt simple. Just pair up disks two by two as...

          That's my favorite way to set it up. I've come around to the mindset of 'the hell with parity' because if you eliminate the raidz everything gets dirt simple. Just pair up disks two by two as mirrors. Use those mirrored pairs as the basic building block for everything and stripe it all, ditch the parity.

          Sure, you lose half your space. In return you get the fastest performance any file system can deliver, with bulletproof fault tolerance, blistering read speeds and near-instant completion of operations that otherwise give ZFS a lot more to chew through. Resilvering is never faster than it is when it's mirrors all the way down.

          Space is cheap. Drives are cheap. Raid parity is an ancient holdover from the days when those statements were less true and the disks were a larger portion of the cost. In fact going with mirrors means you can worry a lot less about the quality of the disks. Replacing failed ones stops being an ordeal, performance is barely impacted at all.

          4 votes
          1. [2]
            vord
            Link Parent
            I'd qualify that slightly to say for home users, limited to 3-4 drives, where downtime is acceptable, and $400 is a pretty big chunk of change... RaidZ is a useful compromise. Heck, personally, I...

            I'd qualify that slightly to say for home users, limited to 3-4 drives, where downtime is acceptable, and $400 is a pretty big chunk of change... RaidZ is a useful compromise.

            Heck, personally, I switched from ZFS to BTRFS RAID1 at home because I have a huge number of disparate drives and it turns out that on consumer networking gear for media consumption...that's plenty of speed, with a ton more flexibility with less maintainence. If a drive dies, I add a new larger one and it just keeps chugging with no capacity loss.

            1 vote
            1. g33kphr33k
              Link Parent
              After taking this thread in to consideration, some of my own research and a ZFS calculator, here's what I'm planning - bear in mind this is for an Archive NAS: 36 Disks at 16TB: 9x16TB RAID-Z2 4 x...

              After taking this thread in to consideration, some of my own research and a ZFS calculator, here's what I'm planning - bear in mind this is for an Archive NAS:

              36 Disks at 16TB:

              9x16TB RAID-Z2
              4 x VDEVs = Data Pool
              Compression disabled, dedupe disabled.

              Raw capacity would be 576TB, but after parity and slop space we're at 422TB usable. In theory, if we call it 70% usable, I'm actually going to cry at 337TB total.

              At this moment, the NAS1 server is rocking 293TB of used space. Since I'm using Restic to do the backup of NAS1 to NAS2, I should see some savings, but I can already see that I will need to grow the shelf soon. I'll nuke NAS2, configure this and get the backup rolling ASAP.

              My bigger concern now is that NAS1 is set up the same way as NAS2, but never had dedupe enabled. At some point we're going to hit a similar issue.

              Thank you for all of your help and input. I'll put this as an edit to my original rant/question.

              3 votes
  6. [3]
    bengine
    Link
    Do you really need deduplication for storing video? Truenas has a great page on dedup with tips on troubleshooting at the end. Recommendation is 1-3GB RAM/TB storage for dedup (probably more) and...

    Do you really need deduplication for storing video? Truenas has a great page on dedup with tips on troubleshooting at the end. Recommendation is 1-3GB RAM/TB storage for dedup (probably more) and a fast SSD special device for the deduplicaiton table. Without enough memory, or a fast device for the dedup table it would be extremely slow as you're using IOPS on your spinning rust for the data and the deduplication table at the same time.

    3 votes
    1. [2]
      g33kphr33k
      Link Parent
      I don't need it, especially now I'm using Restic as the backup method. Issue is, deleting video that has been compressed and deduped on rust with zero SSD cache is killing ZFS. I'm trying to...

      I don't need it, especially now I'm using Restic as the backup method.

      Issue is, deleting video that has been compressed and deduped on rust with zero SSD cache is killing ZFS. I'm trying to delete the videos on the slave to make space to put the new Restic data. Every time I do, I get a OS hang, the IO goes crazy and the system eventually dies. Now the backups are failing. I'm stuck.

      1. bengine
        Link Parent
        Can you add some SSDs as a dedup vdev to help temporarily? Consumer SSDs aren't recommended long term since they'll wear out, but as a short term speedup it could really help. An M.2 PCI card and...

        Can you add some SSDs as a dedup vdev to help temporarily? Consumer SSDs aren't recommended long term since they'll wear out, but as a short term speedup it could really help. An M.2 PCI card and some 1TB consumer drives in mirrored pairs is probably cheaper than adding a boatload of more ram.

  7. xk3
    Link
    Offline dedup makes a lot more sense for large files than online dedup. Just run new files against the existing hashes: https://github.com/martinpitt/fatrace If all your files are big it might...

    Offline dedup makes a lot more sense for large files than online dedup. Just run new files against the existing hashes: https://github.com/martinpitt/fatrace

    If all your files are big it might make sense to do a fast hash by reading partial data like sample_hash.py (when you get a hash match you could then switch to full hash like sample_compare.py#L21)