Recommended ZFS Drive width in a raidZ2 pool

sgt-mike · November 23, 2024, 9:52am

Currently I’m running a Z2 raid width of 9 drives. The purpose of the question is to see what is the actually recommended for drive width for each z2 vdev to a pool.
While everything is running fine I suspect I’m at the upper limit of drives to Z2 vdev. Currently I’m not running hotspares to the pool, but plan to add three drives in a hotspare role soon as I can aquire more drives.

anon85968875 · November 23, 2024, 6:13pm

@sgt-mike, this sounds like to me a big possible 9 drive failure at some point… But please don not take this as factual.

While RAID-Z2 might technically work with the setup you describe, it offers no advantages whatsoever and even has some disadvantages compared to a simple three-way-mirror configuration.

It would be different if ZFS allowed us to dynamically change the redundancy level of or increase the number of devices in a RAID-Zn vdev, but it does not. The only way to actually grow a pool is by adding more vdevs or grow the size of the existing devices, not by adding devices to an existing vdev. (You can add and remove devices in a mirror vdev, but that only changes the amount of redundancy, not the usable storage capacity.)

Always keep in mind that if anything goes wrong, a full ZFS resilver is an arduous process for the remaining disks, and it isn’t unheard of for another disk to develop problems or even die under the stress. For this reason, especially with rotational drives of the sizes common today, double redundancy should be the default choice with triple redundancy an option if you are really paranoid.

Yes I know where you 1st picked up your layout scheme (another Mike we knew)

Just before a bad update hit us in early release on (apparmor) Noble and Jammy.

This was my lay out:

# truncate -s 1G /root/d1 /root/d2 /root/d3
# zpool create tank raidz2 /root/d1 /root/d2 /root/d3
# zpool status tank
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
          raidz2-0    ONLINE       0     0     0
            /root/d1  ONLINE       0     0     0
            /root/d2  ONLINE       0     0     0
            /root/d3  ONLINE       0     0     0

errors: No known data errors

I lost them all grrr…

Now until I can afford to get my hardware I want this is my layout now

zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
tank   112G   787G   108G  /tank

Very simple and basic but I have not suffered a big data loss since then.

Just my 2 cents worth is all.

I’ll also refer to this at times

sgt-mike · November 23, 2024, 9:23pm

Rick, yes you hit on the point I was thinking of . Of course this was after I had set the pool up and loaded it with data. And, at this point is not a problem per se and everything is backed up safely. So if I was to decide to destroy the pool right now to reconfigure it. Outside of the time to copy data back into the pool wouldn’t be a problem.

As you are aware I’ll be adding at least one more pool. Which instead of the way I have it configured currently. But for it to use multiple z2 vdevs not as wide in drives in the vdevs. (just haven’t landed on a number on the drive width per vdev)
This also lead me to consider hot spares and cold spares, which honestly my thought of three ( hots) for 1 pool is probably wrong and should be actually be 1 as I have it configured now. Unless there is more than one z2 vdev in the pool, basically 1 hot spare per vdev in the pool is my thought. With the idea that it might buy me time to get a cold spare (s) in resilver before a domino effect.

sudo zpool status
[sudo] password for mike:
  pool: mediapool1
 state: ONLINE
  scan: scrub repaired 0B in 02:02:20 with 0 errors on Sat Nov 23 06:04:23 2024
config:

        NAME                  STATE     READ WRITE CKSUM
        mediapool1            ONLINE       0     0     0
          raidz2-0            ONLINE       0     0     0
            bay1drive1-part1  ONLINE       0     0     0
            bay1drive2-part1  ONLINE       0     0     0
            bay1drive3-part1  ONLINE       0     0     0
            bay2drive4-part1  ONLINE       0     0     0
            bay2drive5-part1  ONLINE       0     0     0
            bay2drive6-part1  ONLINE       0     0     0
            bay3drive7-part1  ONLINE       0     0     0
            bay3drive8-part1  ONLINE       0     0     0
            bay3drive9-part1  ONLINE       0     0     0

errors: No known data errors
mike@Beastie:~$ sudo zpool list
[sudo] password for mike:
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
mediapool1  32.7T  7.44T  25.3T        -         -     0%    22%  1.00x    ONLINE  -
mike@Beastie:~$

Thanks for the reply I’ll delve more into it (your link). I know this is a balancing act. And one must plan for the drives to fail.

(added)
just a fyi I’m not running ZFS except for the data pools, so I don’t know if that is a plus or minus.

anon85968875 · November 23, 2024, 11:09pm

In my humble view that’s a plus, less to worry about.

sgt-mike · November 24, 2024, 9:07pm

OK seems based on the link provided I didn’t actually ask the question in the absolute correct way, this was apparent after reviewing the links provided. At the bottom of the ZFS calculator page linked there is a article titled
" [3] ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ"
which brings up several scenarios of implementations of ZFS.

best performance on random IOPS
best reliability
best space efficiency
Which one or a combined strategies will be driven by the goal of the admin. /system designer.
I’ve read and re-read the article a couple of times. Even went back to Wintel’s calculator and ran the reliability calculator using differing factors of drive MTBF rate.

But what really stood out was toward the bottom of the author’s quote
" If you are using RAID-Z with 512-byte sector devices with recordsize=4K or 8K and compression=off (but you probably want compression=lz4): use at least 5 disks with RAIDZ1; use at least 6 disks with RAIDZ2; and use at least 11 disks with RAIDZ3.

To summarize: Use RAID-Z. Not too wide. Enable compression."

This doesn’t take away / detract / negate from statements within this discussion thus far. but does seem to be a balance of at least 2 factors from above {reliability, and space efficiency}.
What is not really clear is in the author’s summery is the upper limits of the Zraid configuration, he the author utilized the minimum width of drives on the recommendation. To avoid criticisms ? not sure or if just wanted to get to a point or support / enforce the statements.

Another aspect of the vdev’s composition used we have not discussed is to vary the age /hours of the drives themselves in order to avoid / lessen is probably a better word in order to avoid a domino effect in a drive failure plan.

I have seen other articles that advocate as high as 12 drive wide in a ZRaid2 vdev. Not sure if that is actually sage advice.

But where does this leave the basic question is nine drives too wide ?
Maybe, maybe not seems to be the answer.

This discussion might seem moot it does however assist with the implantation of another pool. Simply build each vdev at 6 wide x 3 vdevs. Which in my thought process would advocate 3 hot spares {1 for each vdev} to the pool not for storage but to buy time for a cold spare replacement drive in a failure.
That plan would allow me to off load my existing data to the new pool . So that I could destroy and reconfigure by either:

splitting the drives to smaller vdevs { six wide} to join the newly created pool to increase the # of vdevs and storage.
Or simply to have another pool by adding three drives and creating another pool 6 wide x 2 vdevs.

hopefully this post either assist others in the designing their pools or it may simply be regarded as how not to set up pools by others. I’m in hopes it is the former.
I chose RAIDZ2 for a balance of reliability and storage efficiency, I could have chosen Z1 , Z3 just as easily, or even a mirror.

1fallen · November 25, 2024, 4:39pm

@sgt-mike I really don’t see any issues with your current setup.

Going 12 wide is something I’m not familiar with so you will have to keep this updated if you ever do 12 drives.

I’m getting itchy now for My new hardware, but I won’t have a need for anything past 4 wide.

I’m just to embedded with the K.I.S.S policy.

sgt-mike · November 25, 2024, 7:50pm

LOL, that causes me to ask… Details man Details… (that is good news)

sgt-mike · December 3, 2024, 4:08pm

While this topic has not gained the traction that I had hoped for.
I have however been researching more into it.
As such have come across a “blog” on the topic of ZFS.
Within that Blog the author covers various ZFS designs / policies and layout.
The Blog can be read here ZFS Blog
I’ll cut an paste the parts dealing with ZFS design an layout for those not wishing to read the blog.

8. RAIDZ - Even/Odd Disk Counts
Try (and not very hard) to keep the number of data disks in a raidz vdev 
to an even number. This means if its raidz1, the total number of disks in 
the vdev would be an odd number. If it is raidz2, an even number, and if
it is raidz3, an odd number again. Breaking this rule has very little repercussion, 
however, so you should do so if your pool layout would be nicer by doing so 
(like to match things up on JBOD's, etc).

9. Pool Design Rules
I've got a variety of simple rules I tell people to follow when building zpools:
Do not use raidz1 for disks 1TB or greater in size.
For raidz1, do not use less than 3 disks, nor more than 7 disks in each vdev 
(and again, they should be under 1 TB in size, preferably under 750 GB in size) 
(5 is a typical average). For raidz2, do not use less than 6 disks, nor more than 
10 disks in each vdev (8 is a typical average).
For raidz3, do not use less than 7 disks, nor more than 15 disks in each vdev 
(13 & 15 are typical average).
Mirrors trump raidz almost every time. Far higher IOPS potential from a mirror
 pool than any raidz pool, given equal number of drives. 
Only downside is redundancy - raidz2/3 are safer, but much slower. 
Only way that doesn't trade off performance for safety is 3-way mirrors, 
but it sacrifices a ton of space (but I have seen customers do this
 - if your environment demands it, the cost may be worth it).
For >= 3TB size disks, 3-way mirrors begin to become more and more compelling.
Never mix disk sizes (within a few %, of course) or speeds (RPM) within a single vdev.
Never mix disk sizes (within a few %, of course) or speeds (RPM) within a zpool, except 
for l2arc & zil devices.
Never mix redundancy types for data vdevs in a zpool (no raidz1 vdev and 2 raidz2 
vdevs, for example)
Never mix disk counts on data vdevs within a zpool (if the first data vdev is 6 disks, 
all data vdevs should be 6 disks).
If you have multiple JBOD's, try to spread each vdev out so that the minimum number 
of disks are in each JBOD. If you do this with enough JBOD's for your chosen redundancy
 level, you can even end up with no SPOF (Single Point of Failure) in the form of JBOD, and if the JBOD's themselves are spread out amongst sufficient HBA's, you can even remove HBA's as a SPOF.
If you keep these in mind when building your pool, you shouldn't end up with something tragic.

I had earlier mentioned hot spare for a redundancy factor of the pool (s).
The Author goes on to argue against the use of hot Spares, to the point of “never use them” . While a compelling argument, actually in my view shouldn’t be dismissed lightly. But even within his statement the stand out was > 72 hrs on a disk replacement. This could in my view be the case easily with a Home Lab system (vacations!!, RV Trips, business trips, etc). So, in the end I’ll think about the pro’s and cons that the author pointed out on hot spares. I’m personally not convinced that they will be a bad thing in a raidz configuration. In his argument of a mirror I can fully see the point.
This leaves us to his points on the raidz2 configuration guidelines. I somehow missed the boat or was too worried about space efficiency to catch the by multiples of even drive width numbers for a raidz2, and the mulitple of odd for raidz1/3. Ok so the author advises no wider than 10 in a raidz2 vdev, that is a guideline number I can live with. Which here shortly more drives will be here , and I could easily reconfigure the vdev width to either 8 or 10.
Thoughts on the author’s points?

1fallen · December 3, 2024, 6:10pm

Fairly nice Blog from 2013*
Just sharing my thoughts here is all.
We just don’t have enough willing participants here with actual hands on experience so I plow forward with a known to work for “ME”, I prefer to deploy RAIDZ2 in vdevs of 4, 6 or 10 devices and RAIDZ3 in vdevs of 7, 11 or 19 devices. I’ve yet to come up with a planned out deployment that favors RAIDZ3 over RAIDZ2 + hot swap.

My thoughts for this is, A hot spare when you can go up in parity level doesn’t make sense, raidz3.

Basically hot spares are a thing for large arrays. If you have 27 disks split into 3 vdevs of 9 disks in RAIDZ3, you can add 3 hot spares to reduce the “HolyCrap I need to fix this quick”, I’m not sure I will ever use a HotSpare in RAIDZ2.

As I grow with my time in ZFS all this may change but for now this will be my plan of attack.
4 Wide just fits my simple needs.

sgt-mike · December 4, 2024, 6:46am

@1fallen
Honestly Rick I don’t think you have a bad plan at all. Nor bring up bad points.

I was in hopes that some of the developers or more people such as yourself would engage here and state where the max break point is for a given configuration. But I guess that was just not to be.
But the researching I have done has caused me to evaluate it more. I “might” redo the Vdev into a 6, 8 or 10 wide, at a later date or I just might leave it alone, just to see how it runs.

I did splurge and have 9 more 4TB SAS drives in bound arriving this week, that leaves what another 40 drives to get or maybe not … it’s just a measly 58 drives total (I was going for shock value ).
I kind of have a plan, which is not set in stone.
One thought is to simply practice replacing a drive or two in my existing pool, just to check on my ability to do the task. Rather than wait until a drive fails and panic (the thought is to stagger the hours). Right now I’m only using 24% of the current configuration. So that should help with the rebuild times a bit.
I’ve been pretty religious on the scrubs so far.

1fallen · December 4, 2024, 5:31pm

That will be my starting point, When I get all my junk together.

I’m going out on limb, but I’ll also use ZFS root on buntu with this new curve.

From 20.04 to 22.04 I had two failures (related to updates), one of which wiped 4TB’s of data, if I get burned once more , Then no more ZFS root for my buntu systems.

I’m more curious about your current setup over a few months time to see how they hold up.
Thanks @sgt-mike for your thread.

All though I question how my earlier post is a solution here. I appreciate the thought though.

sgt-mike · December 4, 2024, 9:47pm

Well why did I click that your post was the solution?

the system on here kept bugging me after X number of post was the solution found in a sidebar. And honestly there hasn’t been much said except for the three of us.
The setup you describe 4 wide with a raidz2 is probably the most durable, going wider is simply enabling more storage efficiency at the cost of redundancy. And to be quite frank I don’t think we will find a true exact solution or balance between redundancy and storage efficiency to the question.

When I actually rounded up the parts for this NFS based on the UF post I had for advice I used a 250gb Nvme to sit the OS on installed 20. something LTS then upgraded the OS to the latest LTS OS version vis the command line it went flawless. I unchecked the LVM offering installed SSH, deselected all other packages.
Then I remove mdadm completely from the system before installing the zfs package, maxed the ram to what the processor (i7-5930k) would address (96gb, which if I upgrade the processor to the latest and greatest lga2011v3 i7 offering I think I could then address 128 gb Ram with a few more cores @ the same speed).
The reasoning to remove mdadm was only based on a post I seen that the author complained that it interfered with a zfs raidz setup they had, and it crashed horribly. Pretty much I stripped the system to just what was needed in a file server.
Then with the LSI HBA installed I started populating the drives and assembling them into a vdev. And as you are aware I played with multiple setups configurations with laptop drives until the first batch of SAS drives arrived. Through differing setups I arrived at where it is now, and honestly the only reason that this vdev is not 10 wide is I didn’t have 10 drives available.

I have entertained the idea to add one more vdev x 9 wide to that pool, or just leave it alone. I think that really no more than 2 vdevs per pool might be where I want to be. There is another argument for a differing post.
And / Or
Simply create another pool with a differing widths in a even number to run against the x9 wide using pretty much the same hardware. Might do both especially with the other Antec case I got that offered 12 5.25" bays that I could use a external drive bay case. Not counting the media servers bays. just need to invest in some expander cards, cabling, and hot swap bay enclosures.
I have done multiple things to get the vdev (pool) to fault some on purpose some incidentally.

I shut the system down and move one drive to a differing cabling location. The drive faulted, it then dawned on me that it is a SCSi … the id changed so yeah… shut it down place the drive back to the original location. cranked it back up all was well. issued a zfs clear to remove the fault history. Had it been a true SATA I don’t know if it would have faulted.
During a scrub I issued a 1 TB write to the pool trying to crash it… yes one drive faulted but the scrub completed and so did the write… again zfs clear … brought the drive back online… issued another scrub … all clean no errors. Honestly I expected way more than one drive faulting.

(maybe I should just post a zfs notes/testing in the lounge vs here and let this die)

1fallen · December 4, 2024, 10:19pm

We can just wait and see how the community views it.

The Three of us is still just you and me, I had multiple names when I first arrived here.

This is more of what I wanted to see, as I’ve not had a chance yet lacking my hardware:

I shut the system down and move one drive to a differing cabling location. The drive faulted, it then dawned on me that it is a SCSi … the id changed so yeah… shut it down place the drive back to the original location. cranked it back up all was well. issued a zfs clear to remove the fault history. Had it been a true SATA I don’t know if it would have faulted.
During a scrub I issued a 1 TB write to the pool trying to crash it… yes one drive faulted but the scrub completed and so did the write… again zfs clear … brought the drive back online… issued another scrub … all clean no errors. Honestly I expected way more than one drive faulting.

Nice, It looks to me like it’s pretty solid then.

Yep that sounds like a worth while effort. I’m curious about any other posters input on this…I just can’t get enough ZFS Gurus, and their offerings…

Good Luck My Friend, I’ll be checking in the Lounge from time to time.

sgt-mike · December 4, 2024, 10:23pm

Hopefully no one objects to it being in the lounge.
I have thought about it a bit before mentioning that.
I “think” it would still stay in the guidelines of that location.
As a guest here on this medium ( I don’t own this site so yeah I’m a guest) I would not want to impose.

1fallen · December 4, 2024, 10:37pm

We are new to this venue so perhaps they will be nice to you or us if it is not the right place