15 votes

Anyone have advice (or horror stories) on setting up a 100GbE NAS with RDMA / SMB Direct?

Posted September 3 by Greg

Tags: linux, ask.advice, smb, networking, rdma

Pretty much what the title says - I'm building out a smallish compute cluster and hoping to set up some centralised storage that won't be a bottleneck, but I'm very much not a networking specialist. Most of the load will be random reads from compute nodes pulling in the bits of various datasets they need to work on.

Is it plausible to throw a 100GbE ConnectX-5 card and 256GB RAM into a consumer AM5 machine, format everything in ZFS, and set up a network share with KSMBD? My understanding is that I want to ensure everything's using mirroring rather than worrying about RAIDZ parity if I'm optimising for speed, which is fine, and I know that I'll only get full throughput as far as things can be cached in RAM - but is it reasonable to expect ZFS ARC to do that caching for me? Dare I hope that the SMB driver will just work if I drop it in there between the filesystem and the NIC? Or have I crossed the line into enterprisey-enough requirements that it's going to be an uphill battle to get this working anywhere near line speed?

3 comments

[2]
davek804
September 3
Link
I assume you're not using 7200RPM spinning disks, right? 100GbE is wildly overkill if so - the disks will the performance limiting factor. As you've mentioned, a robust caching layer and well...

I assume you're not using 7200RPM spinning disks, right? 100GbE is wildly overkill if so - the disks will the performance limiting factor. As you've mentioned, a robust caching layer and well designed applications taking advantage of it can help in a consumer setting.

An NVMe cache pool in front of the hard drives on a 10GbE or 100GbE? Sounds lovely. :)

7 votes
1. Greg (OP)
  September 3
  Link Parent
  Yup, very focused on keeping it so that the majority of active reads come from cache - the broad plan is 256GB RAM, with the hope that the majority of active data comes straight from there, with...
  
  Yup, very focused on keeping it so that the majority of active reads come from cache - the broad plan is 256GB RAM, with the hope that the majority of active data comes straight from there, with 4x 2TB NVMe behind that for intermediate caching, and then 4x 26TB spinning disks behind that for actual bulk storage.
  
  The actual data volumes aren't massive (few hundred GB of underlying data generally used for a multi-hour run), but reads are heavily unpredictable and often repetitive as the loaded data in memory on the compute instances is replaced with processed output and can't be reused for subsequent related operations.
  
  Based on the current setup (dual NVMe drives on each of the couple of test compute nodes I already have running, with relevant data slowly rsynced to them in advance), even the random read performance on those consumer-grade SSDs is something of a bottleneck, which is one of the reasons I'm looking at centralised storage with a decent RAM cache. Using a tmpfs ramdisk on the compute nodes isn't an option because they need their RAM for actual in-progress data.
  
  6 votes
Greg (OP)
3 days, 5 hours ago
Link
Things I've learned as I fall down the research rabbit hole on this one - hopefully useful if anyone comes across this thread later, and with any luck if I've misunderstood anything too...
Things I've learned as I fall down the research rabbit hole on this one - hopefully useful if anyone comes across this thread later, and with any luck if I've misunderstood anything too egregiously someone might see this and flag it before I start building:
- RDMA over ethernet is weird. If you're like me and picked up what little you know about networking 15+ years ago, you probably think of ethernet as a standard defined by things like its tolerance to packet loss, and its use of randomised exponential backoff to deal with collisions. Memories spring up of early 2000s university lecturers explaining that flexibility and resilience is why it won over token ring back in the 80s. It turns out that RoCE v2, which seems to be the go-to standard for this kind of very high speed data access, depends on lossless ethernet - a set of extensions that make the standard quite different to the one I'm familiar with, and kind of do the opposite of what I expect ethernet to be. This doesn't seem to be a problem per se, just a useful piece of understanding when approaching this that I didn't previously have.
- Switches matter a lot. This goes hand in hand with the lossless requirements; there's basically only one 100GbE switch available new at a small business price point, the Mikrotik CRS504-4XQ-IN, and it doesn't support PFC. I've seen conflicting reports on whether that means no RoCE, or partially working RoCE with some bottlenecks, but either way it doesn't seem a sensible option for this use case. That means used enterprise gear instead, which is its own whole learning curve on both the hardware and software side that I haven't yet dived into.
- Affordable NICs use a lot of PCIe lanes. In theory there's enough bandwidth to do 100GbE on a PCIe x4 card - even an M.2 adapter - but that requires PCIe 5.0 support. The sensibly priced 100GbE hardware out there is primarily server-grade stuff from 5+ years ago, meaning a lot of it is PCIe 3.0 and needs a full x16 slot even for a single port running at its rated speed. The NVIDIA/Mellanox cards also tend to have different PCIe version support between models even in a single generation, so that further narrows the field when searching; the newer generations (ConnectX-7 and ConnectX-8) do support PCIe 5.0, but they're still selling for over a thousand per card and are massive overkill here. From what I can see, the Intel E810-CQDA2 seems a decent middle ground option: readily available, cheaper than even some of the older ConnectX cards (I'm seeing around €200 from China), and PCIe 4.0 meaning it'll at least run one of the two ports properly even on a x8 slot.
- Consumer CPUs and motherboards aren't the way to go. Putting together the compute nodes has given me a strong preference for consumer-grade hardware: for parallel workloads you'll get vastly better performance from the six Ryzen CPUs you could buy for the price of one Threadripper, or the four 5090s you could buy for the price of one RTX 6000 Blackwell. Turns out that's absolutely not the way to go where memory capacity and bus bandwidth are the main concern! Second or third gen AMD Epyc platforms are surprisingly affordable, and available in a normal ATX format so you don't need to buy a whole decomissioned server that sounds like a jet engine. The Supermicro H12SSL-I seems to be the best middle ground between price and capability right now (although the H11SSL-I would also probably do the job if you don't mind being limited to Zen 2 chips, or the Gigabyte MZ32-AR0 v3, but I'm seeing reports of that board being very picky about PCIe devices), and costs about the same for the motherboard, 28 core Epyc Milan CPU, and 512GB RAM(!) as it would for a 16 core Ryzen with motherboard and 256GB RAM. The CPU and RAM clocks are meaningfully slower, but should still be more than enough to do the job, and double the RAM plus 128 PCIe lanes for networking and NVMe are both incredibly useful.
So, putting that all together, the plan now is an Epyc 7413, Supermicro H12SSL-I, and 512GB RAM for the actual NAS, which then supports multiple dual-100GbE E810 NICs to directly connect the compute nodes and avoid worrying about switching for the moment. If and when more compute nodes are added, those additional NICs can be repurposed for the new machines, leaving just one in the NAS, and I can figure out which switch will actually do the job (ideally while also not sounding like a jet engine) and spend the time and money to get one and figure out how to configure it.