ZFS Experience testing and methods

Where to begin?
I’m in hopes this is in a appropriate area as truly it maybe technical, have questions but not asking for assistance rather for a conversation / discussion on ZFS on Ubuntu in a headless File Server role. Nothing more, nothing less, it serves no purpose other that serve files, mostly to other Ubuntu systems i.e. media servers are using Jellyfin and Plex. All the media is on the NFS, the media servers are bare and rely on the NFS.

@anon36188615 and I had a discussion today about using the lounge for this purpose and to keep the clutter /noise down on the support side. Plus the drives that I mentioned to him have arrived. currently the system is in scrub (should be completed in 22 minutes or so).
In that thread I posed a question of drive width for a RAIDz2. And really outside Rick and I the question really gained no traction.

So Moderators if this really belongs somewhere else or is just inappropriate please advise or move. As I mentioned in the support side I view myself as a guest on this site. So thank you in advance.

Looking at the history of the system I went with RAIDZ2 x 9 drives width for the vdev for a pool named mediapool1.

History for 'mediapool1':
2024-10-22.09:48:04 zpool create -f mediapool1 raidz2 /dev/disk/by-partuuid/11a27fcc-ebbd-4864-9bc9-5cc7f01dc785 1ccc6753-a8af-41a9-8a3e-3ee420d90f81 507cbe23-0408-48b3-bfbe-e589d19ef8fe 15bf2071-4741-4bde-9b95-15aa69e50c61 169b4665-7aa6-4a6e-bec0-f7364f7097c4 72d42ad6-ebeb-4fb7-a276-51285483bfba 714d75bb-9b2e-43c2-b38b-e4068b21e105 ca189b0c-4e82-49c0-80fe-aa6f9243f01c 9431154c-4449-4716-9c6e-b51e436721d3
2024-10-22.09:53:48 zfs set compress=lz4 mediapool1
2024-10-22.09:55:57 zpool set autoexpand=on mediapool1
2024-10-22.21:21:13 zpool export mediapool1
2024-10-22.21:21:32 zpool import -d /dev/disk/by-id mediapool1
2024-10-22.21:22:22 zpool export mediapool1
2024-10-22.21:23:19 zpool import -d /dev/disk/by-partuuid mediapool1
2024-10-22.21:24:47 zpool export mediapool1
2024-10-22.21:25:03 zpool import -d /dev/disk/by-partuuid mediapool1
2024-10-23.23:32:24 zpool export mediapool1
2024-10-23.23:34:50 zpool import -d /dev/disk/by-id mediapool1
2024-10-23.23:41:01 zpool export mediapool1
2024-10-23.23:41:37 zpool import -d /dev/disk/by-vdev mediapool1

Later after creating the pool I ran across a post elsewhere discussing using alias’s for the drives. Which I thought would be a great way to ID the drive in a faulted /degraded status for replacement. here is the/etc/zfs/vdev_id.conf file I used

#    by-vdev
#------------------------------------------------------------------------------
# setup for mediapool1 raidz2 9 drives wide external bays add 1 spare use internal bay slot
# For an additional pool use internal drive bays within beastie's case Then expand by 
#------------------------------------------------------------------------------
#     name         fully qualified or base name of device link
alias beastdrive1      /dev/disk/by-id/wwn-0x5000c500869ae7bf-part1
#SN ZC116KNN
alias beastdrive2      wwn-0x5000c50093d69a37-part1
#SN ZC11WJ9N
alias beastdrive3      wwn-0x5000c500957d945f-part1
#SN ZC16AC1P
alias beastdrive4      wwn-0x5000c50085716b6b-part1
#SN Z1ZAYE7F
alias beastdrive5      wwn-0x5000c500855fcd97-part1
#SN Z1ZAVMFC
alias beastdrive6      wwn-0x5000c500855fc8c7-part1
#SN Z1ZAVMM6
alias beastdrive7      wwn-0x5000c500631003db-part1
#SN Z1Z77YLN
alias beastdrive8      wwn-0x5000c500579d0a2b-part1
#SN Z1Z2DRWH
alias beastdrive9      wwn-0x5000c500579c809b-part1
#SN Z1Z2DBZC
#alias drive10         /dev/disk/by-id/
#SN XXXXXXXX
#alias drive11         /dev/disk/by-id/
#SN XXXXXXXX
#alias drive12         /dev/disk/by-id/
#SN ZC16AC1P
#alias drive13         /dev/disk/by-id/
#SN XXXXXXXX
#alias drive14         /dev/disk/by-id/
#SN XXXXXXXX
# past this point must go outside Beastie's case
#--------------------------------------------------------
# once fully edited save then issue> sudo udevadm trigger
# note that the alias name is not visable to ls / lsblk or blkid but
# once the zpool is created the zpool will ID by alais name in status and informational reports

In the disk-by -id I chose the wwn# simply because it as well as the SN is usually on the drive label.

mike@Beastie:~$ sudo zpool status
[sudo] password for mike:
  pool: mediapool1
 state: ONLINE
  scan: scrub in progress since Wed Dec  4 16:33:35 2024
        7.89T / 7.89T scanned, 6.41T / 7.89T issued at 1.15G/s
        0B repaired, 81.23% done, 00:22:02 to go
config:

        NAME                   STATE     READ WRITE CKSUM
        mediapool1             ONLINE       0     0     0
          raidz2-0             ONLINE       0     0     0
            beastdrive1-part1  ONLINE       0     0     0
            beastdrive2-part1  ONLINE       0     0     0
            beastdrive3-part1  ONLINE       0     0     0
            beastdrive4-part1  ONLINE       0     0     0
            beastdrive5-part1  ONLINE       0     0     0
            beastdrive6-part1  ONLINE       0     0     0
            beastdrive7-part1  ONLINE       0     0     0
            beastdrive8-part1  ONLINE       0     0     0
            beastdrive9-part1  ONLINE       0     0     0

errors: No known data errors

Now as I mentioned in the other post on the support side I’ll practice on drive replacement (of course I’ll wait until the scrub is completed).
But I was researching the ZFS set autoreplace command. Which according to a 4 year old post on Stack Exchange Ubuntu doesn’t really do a good job of doing. But hey that was 4 years ago
without deleving too far and only to provide a bit of background here is a snippet.

**autoreplace=on | off** Controls automatic device replacement. If set to "off", device replacement must be initiated by the administrator by using the "zpool replace" command. If set to "on", any new device, found in the same physical location as a device that previously belonged to the pool, is automatically formatted and replaced. The default behavior is "off". This property can also be referred to by its shortened column name, "replace".

So I will enable the feature and replace one of the drive with HP 4TB SAS SN NHG8JUYN
wwn-5000CCA2430F88B0 labeled drive. IF all goes I only expect to have to update my configuration file export and import again to have the drive labeled the same as the replaced drive
9:41PM CST update
I set the autoreplace =on … think that it would be as simple as removing a drive and installing a drive directly in it’s place within the drive enclosure. I either failed at something or the feature still is lacking.
SO I did this

2024-12-04.21:28:45 zpool offline mediapool1 beastdrive1-part1
2024-12-04.21:31:51 zpool replace -f mediapool1 /dev/disk/by-vdev/beastdrive1-part1 /dev/disk/by-id/wwn-0x5000cca2430f81c8-part1

here is current status for those interested

mike@Beastie:~$ sudo zpool status
  pool: mediapool1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Dec  4 21:31:41 2024
        6.43T / 7.89T scanned at 7.44G/s, 648G / 7.89T issued at 749M/s
        72.0G resilvered, 8.02% done, 02:49:16 to go
config:

        NAME                                STATE     READ WRITE CKSUM
        mediapool1                          DEGRADED     0     0     0
          raidz2-0                          DEGRADED     0     0     0
            replacing-0                     DEGRADED     0     0     0
              beastdrive1-part1             OFFLINE      0     0     0
              wwn-0x5000cca2430f81c8-part1  ONLINE       0     0     0  block size: 512B configured, 4096B native  (resilvering)
            beastdrive2-part1               ONLINE       0     0     0
            beastdrive3-part1               ONLINE       0     0     0
            beastdrive4-part1               ONLINE       0     0     0
            beastdrive5-part1               ONLINE       0     0     0
            beastdrive6-part1               ONLINE       0     0     0
            beastdrive7-part1               ONLINE       0     0     0
            beastdrive8-part1               ONLINE       0     0     0
            beastdrive9-part1               ONLINE       0     0     0

errors: No known data errors

Now for the over 60 year old Military Retiree to do the adult beverage thing sip on three fingers of Proper number12 while it does it’s thing …
The resilver completed however

mike@Beastie:~$ sudo zpool status
[sudo] password for mike:
  pool: mediapool1
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
        Expect reduced performance.
action: Replace affected devices with devices that support the
        configured block size, or migrate data to a properly configured
        pool.
  scan: resilvered 898G in 03:20:17 with 0 errors on Thu Dec  5 00:51:58 2024
config:

        NAME                              STATE     READ WRITE CKSUM
        mediapool1                        ONLINE       0     0     0
          raidz2-0                        ONLINE       0     0     0
            wwn-0x5000cca2430f81c8-part1  ONLINE       0     0     0  block size: 512B configured, 4096B native
            beastdrive2-part1             ONLINE       0     0     0
            beastdrive3-part1             ONLINE       0     0     0
            beastdrive4-part1             ONLINE       0     0     0
            beastdrive5-part1             ONLINE       0     0     0
            beastdrive6-part1             ONLINE       0     0     0
            beastdrive7-part1             ONLINE       0     0     0
            beastdrive8-part1             ONLINE       0     0     0
            beastdrive9-part1             ONLINE       0     0     0

errors: No known data errors

Yeah first thoughts are OHHH Nooo … honestly not a issue it is a mismatch in the drive geometry (it’s a newer drive, coupled with the o-ashift value setting when the pool was setup) … now how to fix it.

  1. replace enough drives in the existing pool to create a new pool with the older 512b drives this will be to offload the data within the pool . So I can do 4 or 5 (leaning towards 5 for this as it affords enough slope) drives in a raidz1 as it’s just to holding data long enough to destroy the mismatched existing pool (mediapool1), and recreate the pool exactly as before but with the newer drives. Then transfer data back into a properly aligned pool Now will I lose some usable space on the new 4kib pool vs the old 512b yes a little.
    or
  2. replace the mismatched one with the drive I pulled.
    Now will the vdev within the pool be somewhat affected in performance yes if chose to do nothing. So let me proceed with option 1 as I do have a total of nine drives with native 4Kib and I’ll post the results after all the drives in the vdev is replaced.

If I had a Expander on hand it would save the day, because my HBA will only handle 16 drives. I can’t put the new drives into matching size pool . 9+9 = 18 …
Just a quick Update
Replaced enough drives in the original pool to create a Temp Pool to off load data to. Just waiting on the move to complete.

  1. I didn’t set the ashift=12 when I created that pool , hence the error as it was set to the default. To set it to support a 4kib requires destroying the pool. < self inflicted
  2. I did try two differing methods on the replace, which I found out that actually if you don’t offline the drive to be replaced the resilver is actually faster. In my case it was a 30 to 40 min difference per drive that was replaced vs offlining the drive to be replaced.

I’ll reset the pool back up same except for the default ashift valve, so if in the future I use a AF drive I won’t have that happen again.

1 Like

Ok the pool mediapool1 is reset (reconfigured / recreated actually) this time with the -o ashift=12 flag as well as a few other feature that I failed to use initially. That pool is still 9 wide with 1 vdev, but I did chose to match the drives with all 4kib ones.
Once the data is moved back (to mediapool1) I will probably reconfigure the datacatch pool. Actually set them properly inside the case’s internal trays (5 trays). And yes I created that pool with the -o ashift=12 flag as well even though those drives use a native 512b vs the 4kib.
At the end of this I now have four cold spares on hand available to either pool.

P.S.
Also in order to actually stress the system /s while I’m moving media from one pool to another on the NFS. As well as uploading data from another system to the target pool. Two different media servers are active and pulling data as the writes occur. so actually three systems are accessing the NFS at the same time. So far no hiccups
18:00 hr CST update
completed everything is intact scrubbed etc etc everything is back to normal.
On the mediapool1 which originally had Seagate ST4000NM0023 (6gp/s) and ST4000NM0025 (12gp/s) drive mix (6 was the 0023 and three was 0025).
I replaced all of them with HP label WD /HGST HUS726040ALS214 4k AF all matching and low hours.
I’m currently going back through the Seagates , (low level formatting, and sorting them) which when completed I’ll put them into a differing zpool configuration. I really liked the data catch pool that I used to transfer the data to. So I’ll probably re-establish it, I did make a mistake on the 1st Seagate drive in sg_format the result was that the drive returned 0 TB. So I’m attempting to correct that now. I had two drives in and formatting them one had a typo in the size that I didn’t catch (the one reporting 0TB from lsblk command), the correctly issued command completed with no problems. I’m attempting a recovery of that drive now.

Just a update in case anyone is actually watching this. (@anon36188615 )
All the drives (Dell branded Seagates) are recovered nicely. I had several reasons why I was so animate about pulling those drives in the original vdev for the mediapool.

  1. there was a mix of 12gb/s and 6 gb/s which meant even though the size matched. And really was only a issue in my head. As they would step down to match the slower drives without issue.
  2. In the enclosure from the very beginning the lights colors would never match. I tore through the drives before establishing the pool and could not find something that would trigger it. So I thought at first it was the three 12 gb/s drive that was blue lights and the remaining six gb/s drives that was showing red (not the case) . The vdev ran perfect no issues what so ever. But the lights bugged the crap out of me.

After pulling them and replacing with the HP HGST 4KN AF drives I’ve been going through them (Dell branded Seagates) down to the lowest level. It turned out to be the stupidest, actually minor is a better word.
Via the openSeaChest package I found out that 6 of them had the ready led turned off, and three was on (they are all on now via the software). Now I’m sorting them into groups I’ll probably put the three 12gb/s on cold spare shelf (1 for sure). Then set the others into a differing pool , what type I’m still debating. So I’ll be on the hunt again for more hardware to set up the next pool.
Which I’m considering a z1 x 5 wide x 2 vdevs on that one, or possibly a z2 x 8 wide. The existing z2x9 wide x 1 vdev I’ll run it as it stands to see how it stands up. Against updates etc etc and to see if it’s basically too wide in drives, the whole purpose was just to test it anyway against the concept of z2 should be even numbers in width concept.
Until I get the next shipment of hardware I’m playing with the openSeaChest which I actually like.

sudo openSeaChest_Basics --scan --onlySeagate
==========================================================================================
 openSeaChest_Basics - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Basics Version: 3.5.4-6_2_0 X86_64
 Build Date: Jul 18 2024
 Today: Mon Dec  9 22:53:14 2024        User: root
==========================================================================================
Vendor   Handle       Model Number            Serial Number          FwRev
SEAGATE  /dev/sg9     ST4000NM0025            ZC16AC1P               DE07
SEAGATE  /dev/sg10    ST4000NM0025            ZC11WJ9N               DE07

 sudo openSeaChest_Basics -d all --readyLED info --onlySeagate
==========================================================================================
 openSeaChest_Basics - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Basics Version: 3.5.4-6_2_0 X86_64
 Build Date: Jul 18 2024
 Today: Mon Dec  9 22:28:40 2024        User: root
==========================================================================================

/dev/sg9 - ST4000NM0023 - Z1Z77YLN - GS0F - SCSI
Ready LED is set to "On"

/dev/sg10 - ST4000NM0023 - Z1Z2DBZC - GS0D - SCSI
Ready LED is set to "On"

/dev/sg11 - ST4000NM0023 - Z1ZAVMFC - GE13 - SCSI
Ready LED is set to "Off"

/dev/sg12 - ST4000NM0023 - Z1ZAVMM6 - GE13 - SCSI
Ready LED is set to "On"

sudo openSeaChest_Format -d /dev/sg9 --formatUnit  current  --poll   --discardGList --confirm this-will-erase-data
==========================================================================================
 openSeaChest_Format - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Format Version: 3.0.4-6_2_0 X86_64
 Build Date: Jul 18 2024
 Today: Mon Dec  9 23:11:09 2024        User: root
==========================================================================================

/dev/sg9 - ST4000NM0023 - Z1Z2DBZC - GS0D - SCSI
Format Unit
Performing SCSI drive format.
Depending on the format request, this could take minutes to hours or days.
Do not remove power or attempt other access as interrupting it may make
the drive unusable or require performing this command again!!
Progress will be updated every  5 minutes
        Percent Complete: 10.56%

As you can tell I use the daylights out of the --onlySeagate flag which exclude all the other drives. I really Like that one can pull smart data for all the seagates or all drives at one command.

1 Like

@sgt-mike I am here and monitoring your progress.

Only as a on looker ATM :smiley:

Just got done with the first spare using the openSeaChest Utility to do low level format which I had it remove the g-list completely. Because it is actually in a format should redirect the errors to the P-list. Now doing the last bit with badblock to ensure that the drive is healthy and ready for use. My thought was to zero the grown defect list so I could actually monitor if it truly grows thus predicting a failure. Not for a repair but predictive use only.

mike@Beastie:~$ sudo openSeaChest_SMART -d /dev/sg11 --idd short --foreground
==========================================================================================
 openSeaChest_SMART - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_SMART Version: 2.3.2-6_2_0 X86_64
 Build Date: Jul 18 2024
 Today: Tue Dec 10 13:16:10 2024        User: root
==========================================================================================

/dev/sg11 - ST4000NM0025 - ZC16AC1P - DE07 - SCSI
The In Drive Diagnostics (IDD) test will take approximately  2 minutes

    IDD test is still in progress...please wait

IDD - short - completed without error!

the idd flag by the documentation is ONLY for Seagate drive as it perform internal diagnostic test.

mike@Beastie:~$ sudo openSeaChest_SMART -d /dev/sg11 --smartCheck
==========================================================================================
 openSeaChest_SMART - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_SMART Version: 2.3.2-6_2_0 X86_64
 Build Date: Jul 18 2024
 Today: Tue Dec 10 13:22:04 2024        User: root
==========================================================================================

/dev/sg11 - ST4000NM0025 - ZC16AC1P - DE07 - SCSI
SMART Check
SMART Check Passed!

so far good

mike@Beastie:~$ sudo openSeaChest_SMART -d /dev/sg11 --showSCSIDefects g
==========================================================================================
 openSeaChest_SMART - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_SMART Version: 2.3.2-6_2_0 X86_64
 Build Date: Jul 18 2024
 Today: Tue Dec 10 13:23:58 2024        User: root
==========================================================================================

/dev/sg11 - ST4000NM0025 - ZC16AC1P - DE07 - SCSI
===SCSI Defect List===
        List includes grown defects
---Physical Sector Format---
Total Defects in list: 3
  Cylinder  Head      Sector
    236609    2        2555
    283883    2         864
    295583    2        1076

When I started the drive had displayed a higher number of errors than three in the g-list (7 IRC), but now when I perform a SMART check the grown defect list is at 0.
The true defects didn’t go away or was repaired they was just added tp the p-list via the idd command.
The end result is now I don’t have to rely on memory or notes to monitor the g-list. But if it starts growing then it’s time to pull out of service. Just waiting on bad blocks to complete now. Then it’s to the shelf for this one. (not a active disk shelf but a true shelf for cold spare as interim use only)
This group of drives do have a bit of high hours but probably not at the expected point of failure yet. I’d venture a guess that probably 25% of the expected life is remaining. Like I said it’s a personal test for me. The reasoning for listing information this sort of promoting the openSeaChest utilities as I knew some would find a interest. I didn’t notice a big difference in the time required for sg_format vs he openSeaChest_Format to complete. One appears just as good as the other. Looks to be a good /viable addition to sg_utils package.
–added–
I was actually running three drives in that test - found one I’ll label scrub duty only. When I regenerated the g-list on it errors increased. So no pools or /cold spare duty for it. My thought was to stress them indivdually for a predictive determination.

Today after a scrub I decided to see if there was a actual improvement when I re-did the pool using 4kn drives (HGST drives) versus the 512b Seagates in that pool. I had looked over several sites all claiming to have the answer for ZFS.
Here is the history of the creation of the revamped pool.

2024-12-06.15:12:28 zpool create -f -o ashift=12 -o autoexpand=on -o autoreplace=on mediapool1 raidz2 /dev/disk/by-id/wwn-0x5000cca2430f81c8-part1 wwn-0x5000cca2430eb1c4-part1 wwn-0x5000cca2430e6bd4-part1 wwn-0x5000cca2430f88b0-part1 wwn-0x5000cca2430efdac-part1 wwn-0x5000cca2430f90fc-part1 wwn-0x5000cca2430f7bd0-part1 wwn-0x5000cca2430f8154-part1 wwn-0x5000cca2430f779c-part1
2024-12-06.15:12:49 zfs set compression=lz4 recordsize=1M xattr=sa atime=off mediapool1
2024-12-06.15:20:48 zpool export mediapool1
2024-12-06.15:21:58 zpool import -d /dev/disk/by-vdev mediapool1

Now to check the benchmarks: the original pool first for write,

Starting 1 process
TEST: Laying out IO file (1 file / 2048MiB)
Jobs: 1 (f=1): [W(1)][-.-%][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3531635: Sun Nov 24 22:32:27 2024
  write: IOPS=1658, BW=1658MiB/s (1739MB/s)(10.0GiB/6175msec); 0 zone resets
    slat (usec): min=126, max=2163, avg=312.71, stdev=190.20
    clat (usec): min=2, max=2942.8k, avg=18623.59, stdev=161147.33
     lat (usec): min=167, max=2942.9k, avg=18936.30, stdev=161146.22
    clat percentiles (msec):
     |  1.00th=[    5],  5.00th=[    6], 10.00th=[    6], 20.00th=[    6],
     | 30.00th=[    6], 40.00th=[    7], 50.00th=[    7], 60.00th=[    8],
     | 70.00th=[   12], 80.00th=[   14], 90.00th=[   22], 95.00th=[   22],
     | 99.00th=[   25], 99.50th=[   27], 99.90th=[ 2937], 99.95th=[ 2937],
     | 99.99th=[ 2937]
   bw (  MiB/s): min= 1422, max= 5370, per=100.00%, avg=2848.29, stdev=1523.61, samples=7
   iops        : min= 1422, max= 5370, avg=2848.29, stdev=1523.61, samples=7
  lat (usec)   : 4=0.03%, 10=0.02%, 250=0.04%, 500=0.04%, 750=0.06%
  lat (usec)   : 1000=0.04%
  lat (msec)   : 2=0.21%, 4=0.39%, 10=67.64%, 20=18.25%, 50=12.98%
  lat (msec)   : >=2000=0.30%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=2938.3M, max=2938.3M, avg=2938325079.00, stdev= 0.00
    sync percentiles (msec):
     |  1.00th=[ 2937],  5.00th=[ 2937], 10.00th=[ 2937], 20.00th=[ 2937],
     | 30.00th=[ 2937], 40.00th=[ 2937], 50.00th=[ 2937], 60.00th=[ 2937],
     | 70.00th=[ 2937], 80.00th=[ 2937], 90.00th=[ 2937], 95.00th=[ 2937],
     | 99.00th=[ 2937], 99.50th=[ 2937], 99.90th=[ 2937], 99.95th=[ 2937],
     | 99.99th=[ 2937]
  cpu          : usr=8.28%, sys=39.07%, ctx=3024, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=1658MiB/s (1739MB/s), 1658MiB/s-1658MiB/s (1739MB/s-1739MB/s), io=10.0GiB (10.7GB), run=6175-6175msec

Now for the revamped pool with the “improvements” (which was geared toward video files)

Starting 1 process
TEST: Laying out IO file (1 file / 2048MiB)
Jobs: 1 (f=1): [W(1)][-.-%][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=750461: Thu Dec 12 09:28:02 2024
  write: IOPS=1624, BW=1625MiB/s (1704MB/s)(10.0GiB/6302msec); 0 zone resets
    slat (usec): min=116, max=2078, avg=337.16, stdev=234.15
    clat (usec): min=2, max=2816.8k, avg=19005.48, stdev=154262.90
     lat (usec): min=172, max=2817.0k, avg=19342.65, stdev=154265.87
    clat percentiles (msec):
     |  1.00th=[    5],  5.00th=[    6], 10.00th=[    6], 20.00th=[    6],
     | 30.00th=[    6], 40.00th=[    7], 50.00th=[    8], 60.00th=[    9],
     | 70.00th=[   11], 80.00th=[   17], 90.00th=[   23], 95.00th=[   24],
     | 99.00th=[   33], 99.50th=[   33], 99.90th=[ 2802], 99.95th=[ 2802],
     | 99.99th=[ 2802]
   bw (  MiB/s): min= 1170, max= 4592, per=100.00%, avg=2848.29, stdev=1549.95, samples=7
   iops        : min= 1170, max= 4592, avg=2848.29, stdev=1549.95, samples=7
  lat (usec)   : 4=0.03%, 10=0.02%, 250=0.03%, 500=0.04%, 750=0.05%
  lat (usec)   : 1000=0.06%
  lat (msec)   : 2=0.19%, 4=0.42%, 10=63.85%, 20=21.01%, 50=14.01%
  lat (msec)   : >=2000=0.30%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=2814.1M, max=2814.1M, avg=2814063558.00, stdev= 0.00
    sync percentiles (msec):
     |  1.00th=[ 2802],  5.00th=[ 2802], 10.00th=[ 2802], 20.00th=[ 2802],
     | 30.00th=[ 2802], 40.00th=[ 2802], 50.00th=[ 2802], 60.00th=[ 2802],
     | 70.00th=[ 2802], 80.00th=[ 2802], 90.00th=[ 2802], 95.00th=[ 2802],
     | 99.00th=[ 2802], 99.50th=[ 2802], 99.90th=[ 2802], 99.95th=[ 2802],
     | 99.99th=[ 2802]
  cpu          : usr=8.19%, sys=36.22%, ctx=2896, majf=0, minf=14
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=1625MiB/s (1704MB/s), 1625MiB/s-1625MiB/s (1704MB/s-1704MB/s), io=10.0GiB (10.7GB), run=6302-6302msec

Now to move to the read as really for what I’m using the ZFS system with the NFS here is where the money is, again original pool first,

Starting 1 process
Jobs: 1 (f=1): [R(1)][-.-%][r=3214MiB/s][r=3214 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3531765: Sun Nov 24 22:33:07 2024
  read: IOPS=3202, BW=3202MiB/s (3358MB/s)(10.0GiB/3198msec)
    slat (usec): min=171, max=834, avg=309.34, stdev=48.24
    clat (usec): min=2, max=21470, avg=9565.43, stdev=1485.73
     lat (usec): min=304, max=22241, avg=9874.76, stdev=1523.34
    clat percentiles (usec):
     |  1.00th=[ 5604],  5.00th=[ 6194], 10.00th=[ 9503], 20.00th=[ 9503],
     | 30.00th=[ 9634], 40.00th=[ 9634], 50.00th=[ 9634], 60.00th=[ 9634],
     | 70.00th=[ 9634], 80.00th=[ 9634], 90.00th=[11076], 95.00th=[11207],
     | 99.00th=[14746], 99.50th=[16188], 99.90th=[17171], 99.95th=[19268],
     | 99.99th=[21103]
   bw (  MiB/s): min= 3116, max= 3228, per=99.69%, avg=3192.00, stdev=45.66, samples=6
   iops        : min= 3116, max= 3228, avg=3192.00, stdev=45.66, samples=6
  lat (usec)   : 4=0.05%, 500=0.05%, 750=0.05%, 1000=0.05%
  lat (msec)   : 2=0.15%, 4=0.29%, 10=87.45%, 20=11.88%, 50=0.04%
  cpu          : usr=0.72%, sys=98.44%, ctx=147, majf=0, minf=8202
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=3202MiB/s (3358MB/s), 3202MiB/s-3202MiB/s (3358MB/s-3358MB/s), io=10.0GiB (10.7GB), run=3198-3198msec

The pool with the 4kn drives,

Starting 1 process
Jobs: 1 (f=1)
TEST: (groupid=0, jobs=1): err= 0: pid=750731: Thu Dec 12 09:29:46 2024
  read: IOPS=3714, BW=3714MiB/s (3895MB/s)(10.0GiB/2757msec)
    slat (usec): min=136, max=1003, avg=267.12, stdev=44.85
    clat (usec): min=2, max=24004, avg=8253.51, stdev=1022.67
     lat (usec): min=257, max=24946, avg=8520.63, stdev=1049.01
    clat percentiles (usec):
     |  1.00th=[ 5276],  5.00th=[ 7504], 10.00th=[ 7701], 20.00th=[ 7963],
     | 30.00th=[ 8094], 40.00th=[ 8160], 50.00th=[ 8160], 60.00th=[ 8160],
     | 70.00th=[ 8225], 80.00th=[ 8225], 90.00th=[ 9241], 95.00th=[ 9634],
     | 99.00th=[10945], 99.50th=[11731], 99.90th=[19006], 99.95th=[21627],
     | 99.99th=[23725]
   bw (  MiB/s): min= 3140, max= 3878, per=99.18%, avg=3683.60, stdev=310.75, samples=5
   iops        : min= 3140, max= 3878, avg=3683.60, stdev=310.75, samples=5
  lat (usec)   : 4=0.05%, 500=0.07%, 750=0.05%, 1000=0.05%
  lat (msec)   : 2=0.19%, 4=0.37%, 10=97.41%, 20=1.73%, 50=0.09%
  cpu          : usr=0.65%, sys=99.27%, ctx=8, majf=0, minf=8203
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=3714MiB/s (3895MB/s), 3714MiB/s-3714MiB/s (3895MB/s-3895MB/s), io=10.0GiB (10.7GB), run=2757-2757msec

For the comparison I had used the same commands for the read write, Which I purposely excluded for brevity sake. And, to be fair the pool did grow by about a TB since the seagates was benchmarked.

This is a very comprehensive set of data and notes, and honestly looks more like a substantial blog post, that others might benefit from reading. Here in the lounge, this thread is invisible to most people.

Also, amusingly, given the quote above, the following is the only question I can see in the entire thread.

Where to begin?

:wink:

Hmmm very good points @popey .
:smile: kind of sharing notes really I guess, as honestly I’m just staring out with the ZFS filesystem. And learning it from the ground up, I honestly am trying to crash it to a degree.
So actually not sure if it’s wise to have most folks read it as from what I’ve read the way or rather manner. I have the pool is rather odd / or not supposed to be the best method. 9 drives wide is supposed to be bad according to some, then others well it’s fine.
Plus I didn’t really want to clutter up the wrong areas and cause confusion, such as the help / support section.
P.S. —added later…
I been doing a little more research on the autoreplace feature. And may have found the smoking gun so to speak. Set up a small test pool to play with it just that aspect. Populating it with redundant data at about 1.5 TB to check it out and scrubbing prior to a actual test.

notes on the attempt @anon36188615
Ok that did NOT work I was wanting to get around having to use a non-multipath configuration with direct-attached SAS topology in the vdev_id.conf (currently using device link aliases) by editing the drive wwn# that I pulled with a replacement drives wwn# . Then issuing a udevadm trigger, slide in new drive. Thinking that it would force a resilver and a write to the new replacement drive. Even attempted a reboot, still complained to issue a replacement command, funny note is it remembered the former member wwn# even after the root. So I edited the vdev_id.conf back to the original settings issued another udevadm trigger slid the original back in same slot. Boom resilvered in 00:00:01 time frame with no error back to healthy.
IF I had just issued the replace command I have no doubt it would process it. Just was looking for a work around vs the direct-attached SAS topology. Nope if I want the autoreplace function to work looks like I’ll actually have to use the one of the SAS topology methods.
LOL but dumping that one drive in that Z1 vdev pool (FYI it is 4 drive wide) sure did panic it. So much for the lazy attempt workaround. (The issue it seems is that both drive must be named in the system exactly thus the SAS Topology is about the only way by physical slot, or location on the HBA)
Which honestly the replace command is not bad, or horrible, was just looking for “stupid simple” method that autoreplace offers. In the words of Thomas Edison I now know 99 ways not to make a light bulb.

1 Like

Hope everyone Holiday was nice and enjoyable.
I decided to add a LVM pool using two older SATA drive that was collecting dust (went via the sata controller on the main board in lieu of the LSI SAS controller). Then added a simple 3- disk raidz pool for use as a temp holding spot for data, in case of the need to transfer data.
My storage looks like this right now

Filesystem                 Size  Used Avail Use% Mounted on
mediapool1                  25T  6.6T   19T  27% /mediapool1
recoverpool                7.2T  1.0M  7.2T   1% /recoverpool
/dev/mapper/volgrp01-lv01  2.7T   28G  2.5T   2% /datagrp

the ZFS breakout is like this

mike@Beastie:~$ sudo zfs list
NAME          USED  AVAIL  REFER  MOUNTPOINT
mediapool1   6.58T  18.2T  6.58T  /mediapool1
recoverpool   296K  7.14T  30.6K  /recoverpool
mike@Beastie:~$ sudo zpool status
  pool: mediapool1
 state: ONLINE
  scan: scrub repaired 0B in 01:48:13 with 0 errors on Wed Dec 25 19:20:47 2024
config:

        NAME        STATE     READ WRITE CKSUM
        mediapool1  ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            drive1  ONLINE       0     0     0
            drive2  ONLINE       0     0     0
            drive3  ONLINE       0     0     0
            drive4  ONLINE       0     0     0
            drive5  ONLINE       0     0     0
            drive6  ONLINE       0     0     0
            drive7  ONLINE       0     0     0
            drive8  ONLINE       0     0     0
            drive9  ONLINE       0     0     0

errors: No known data errors

  pool: recoverpool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:00 with 0 errors on Wed Dec 25 17:38:00 2024
config:

        NAME         STATE     READ WRITE CKSUM
        recoverpool  ONLINE       0     0     0
          raidz1-0   ONLINE       0     0     0
            tray10   ONLINE       0     0     0
            tray11   ONLINE       0     0     0
            tray12   ONLINE       0     0     0

errors: No known data errors

I’ll add a another 3 disk vdev to that raidz pool, but I’ve filled the bays in the NFS case so more hardware is needed (several expander cards, cables, enclosures, adapters, using other existing systems bays and drive slots, last but not least is the Antec 1200 case which will be literally a homemade disk shelf).
I’m leaning towards getting another lot of (9+) 4kn 4TB drives to add to the raidz2 pool (probably HP / HGST in order to match the drives up). Just thinking out loud I could shift to a raidz3 at this point, or simply leave as is. (although just adding another vdev or two does bring up the lack of native restriping with ZFS)
I do have on hand 9 Seagate 4TB drives which honestly is a mismatch 6 are same model # but rated at 6gb/s (which three are already in the raidz pool). This leaves, three which are a different model rated at 12gb/s. So of the 9 Seagate drives 6 will go into the raidz pool. The left over three I’ll think of something later for those. Maybe another LVM group or just join the existing one.

Benchmarking
the LVM first:

mike@Beastie:/datagrp$ sudo fio --name TEST --eta-newline=5s --filename=temp.file --rw=write --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.36
Starting 1 process
TEST: Laying out IO file (1 file / 2048MiB)
Jobs: 1 (f=1): [W(1)][12.3%][w=178MiB/s][w=178 IOPS][eta 00m:50s]
Jobs: 1 (f=1): [W(1)][22.8%][w=179MiB/s][w=179 IOPS][eta 00m:44s]
Jobs: 1 (f=1): [W(1)][33.3%][w=172MiB/s][w=172 IOPS][eta 00m:38s]
Jobs: 1 (f=1): [W(1)][43.9%][w=174MiB/s][w=174 IOPS][eta 00m:32s]
Jobs: 1 (f=1): [W(1)][54.4%][w=176MiB/s][w=176 IOPS][eta 00m:26s]
Jobs: 1 (f=1): [W(1)][64.9%][w=162MiB/s][w=162 IOPS][eta 00m:20s]
Jobs: 1 (f=1): [W(1)][76.8%][w=199MiB/s][w=199 IOPS][eta 00m:13s]
Jobs: 1 (f=1): [W(1)][86.0%][w=192MiB/s][w=192 IOPS][eta 00m:08s]
Jobs: 1 (f=1): [W(1)][98.2%][w=200MiB/s][w=200 IOPS][eta 00m:01s]
Jobs: 1 (f=1): [W(1)][100.0%][w=156MiB/s][w=156 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=952427: Wed Dec 25 21:49:55 2024
  write: IOPS=179, BW=179MiB/s (188MB/s)(10.0GiB/57091msec); 0 zone resets
    slat (usec): min=37, max=177351, avg=276.69, stdev=4869.42
    clat (msec): min=2, max=491, avg=177.97, stdev=37.32
     lat (msec): min=2, max=491, avg=178.25, stdev=37.18
    clat percentiles (msec):
     |  1.00th=[   68],  5.00th=[  157], 10.00th=[  159], 20.00th=[  159],
     | 30.00th=[  167], 40.00th=[  169], 50.00th=[  176], 60.00th=[  176],
     | 70.00th=[  184], 80.00th=[  186], 90.00th=[  203], 95.00th=[  232],
     | 99.00th=[  334], 99.50th=[  368], 99.90th=[  451], 99.95th=[  468],
     | 99.99th=[  489]
   bw (  KiB/s): min=81920, max=225280, per=99.86%, avg=183403.79, stdev=18376.72, samples=114
   iops        : min=   80, max=  220, avg=179.11, stdev=17.95, samples=114
  lat (msec)   : 4=0.02%, 10=0.15%, 20=0.17%, 50=0.31%, 100=1.56%
  lat (msec)   : 250=94.11%, 500=3.68%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=413542k, max=413542k, avg=413541974.00, stdev= 0.00
    sync percentiles (msec):
     |  1.00th=[  414],  5.00th=[  414], 10.00th=[  414], 20.00th=[  414],
     | 30.00th=[  414], 40.00th=[  414], 50.00th=[  414], 60.00th=[  414],
     | 70.00th=[  414], 80.00th=[  414], 90.00th=[  414], 95.00th=[  414],
     | 99.00th=[  414], 99.50th=[  414], 99.90th=[  414], 99.95th=[  414],
     | 99.99th=[  414]
  cpu          : usr=1.25%, sys=1.34%, ctx=9521, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=179MiB/s (188MB/s), 179MiB/s-179MiB/s (188MB/s-188MB/s), io=10.0GiB (10.7GB), run=57091-57091msec

Disk stats (read/write):
    dm-0: ios=0/10304, sectors=0/20906664, merge=0/0, ticks=0/1794877, in_queue=1794877, util=99.87%, aggrios=18/5341, aggsectors=348/10486100, aggrmerge=0/152, aggrticks=2134/927721, aggrin_queue=931588, aggrutil=99.87%
  sdb: ios=26/13, sectors=408/0, merge=0/0, ticks=3/2, in_queue=6, util=0.01%
  sda: ios=11/10669, sectors=288/20972200, merge=0/305, ticks=4266/1855440, in_queue=1863170, util=99.87%
mike@Beastie:/datagrp$ sudo fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [R(1)][12.3%][r=151MiB/s][r=151 IOPS][eta 00m:50s]
Jobs: 1 (f=1): [R(1)][23.2%][r=179MiB/s][r=179 IOPS][eta 00m:43s]
Jobs: 1 (f=1): [R(1)][33.9%][r=178MiB/s][r=178 IOPS][eta 00m:37s]
Jobs: 1 (f=1): [R(1)][45.5%][r=178MiB/s][r=178 IOPS][eta 00m:30s]
Jobs: 1 (f=1): [R(1)][56.4%][r=203MiB/s][r=203 IOPS][eta 00m:24s]
Jobs: 1 (f=1): [R(1)][67.3%][r=203MiB/s][r=203 IOPS][eta 00m:18s]
Jobs: 1 (f=1): [R(1)][78.2%][r=192MiB/s][r=192 IOPS][eta 00m:12s]
Jobs: 1 (f=1): [R(1)][89.1%][r=192MiB/s][r=192 IOPS][eta 00m:06s]
Jobs: 1 (f=1): [R(1)][100.0%][r=178MiB/s][r=178 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=956513: Wed Dec 25 21:51:37 2024
  read: IOPS=185, BW=185MiB/s (194MB/s)(10.0GiB/55279msec)
    slat (usec): min=35, max=527, avg=85.89, stdev=20.37
    clat (msec): min=12, max=353, avg=172.53, stdev=21.26
     lat (msec): min=12, max=353, avg=172.62, stdev=21.25
    clat percentiles (msec):
     |  1.00th=[  117],  5.00th=[  159], 10.00th=[  159], 20.00th=[  159],
     | 30.00th=[  167], 40.00th=[  169], 50.00th=[  176], 60.00th=[  176],
     | 70.00th=[  180], 80.00th=[  184], 90.00th=[  184], 95.00th=[  184],
     | 99.00th=[  264], 99.50th=[  300], 99.90th=[  342], 99.95th=[  347],
     | 99.99th=[  355]
   bw (  KiB/s): min=151552, max=208896, per=100.00%, avg=189868.22, stdev=10727.60, samples=110
   iops        : min=  148, max=  204, avg=185.42, stdev=10.48, samples=110
  lat (msec)   : 20=0.10%, 50=0.29%, 100=0.46%, 250=97.83%, 500=1.32%
  cpu          : usr=0.10%, sys=1.70%, ctx=10091, majf=0, minf=8204
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=185MiB/s (194MB/s), 185MiB/s-185MiB/s (194MB/s-194MB/s), io=10.0GiB (10.7GB), run=55279-55279msec

Disk stats (read/write):
    dm-0: ios=10215/5, sectors=20920320/32, merge=0/0, ticks=1748749/1897, in_queue=1750646, util=99.89%, aggrios=5440/2, aggsectors=10485760/16, aggrmerge=0/0, aggrticks=930915/785, aggrin_queue=931986, aggrutil=99.83%
  sdb: ios=0/1, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  sda: ios=10880/4, sectors=20971520/32, merge=0/1, ticks=1861831/1571, in_queue=1863973, util=99.83%

Now the simple raidz x three 4TB drives:

mike@Beastie:/recoverpool$ sudo fio --name TEST --eta-newline=5s --filename=temp.file --rw=write --size=2g --io_size=10g --
blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
[sudo] password for mike:
TEST: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.36
Starting 1 process
TEST: Laying out IO file (1 file / 2048MiB)
Jobs: 1 (f=1): [W(1)][100.0%][eta 00m:00s]
Jobs: 1 (f=1): [W(1)][100.0%][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=952068: Wed Dec 25 21:47:03 2024
  write: IOPS=845, BW=845MiB/s (886MB/s)(10.0GiB/12113msec); 0 zone resets
    slat (usec): min=147, max=2883, avg=439.71, stdev=360.74
    clat (usec): min=3, max=7565.0k, avg=36586.35, stdev=414893.30
     lat (usec): min=205, max=7565.2k, avg=37026.06, stdev=414884.55
    clat percentiles (msec):
     |  1.00th=[    7],  5.00th=[    7], 10.00th=[    7], 20.00th=[    7],
     | 30.00th=[    7], 40.00th=[   10], 50.00th=[   15], 60.00th=[   16],
     | 70.00th=[   18], 80.00th=[   20], 90.00th=[   22], 95.00th=[   30],
     | 99.00th=[   36], 99.50th=[   37], 99.90th=[ 7550], 99.95th=[ 7550],
     | 99.99th=[ 7550]
   bw (  MiB/s): min=  948, max= 4916, per=100.00%, avg=2215.33, stdev=1208.91, samples=9
   iops        : min=  948, max= 4916, avg=2215.33, stdev=1208.91, samples=9
  lat (usec)   : 4=0.02%, 10=0.02%, 20=0.01%, 250=0.03%, 500=0.04%
  lat (usec)   : 750=0.04%, 1000=0.03%
  lat (msec)   : 2=0.14%, 4=0.28%, 10=39.90%, 20=44.63%, 50=14.56%
  lat (msec)   : >=2000=0.30%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=7563.2M, max=7563.2M, avg=7563169066.00, stdev= 0.00
    sync percentiles (msec):
     |  1.00th=[ 7550],  5.00th=[ 7550], 10.00th=[ 7550], 20.00th=[ 7550],
     | 30.00th=[ 7550], 40.00th=[ 7550], 50.00th=[ 7550], 60.00th=[ 7550],
     | 70.00th=[ 7550], 80.00th=[ 7550], 90.00th=[ 7550], 95.00th=[ 7550],
     | 99.00th=[ 7550], 99.50th=[ 7550], 99.90th=[ 7550], 99.95th=[ 7550],
     | 99.99th=[ 7550]
  cpu          : usr=4.98%, sys=21.56%, ctx=4139, majf=0, minf=14
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=845MiB/s (886MB/s), 845MiB/s-845MiB/s (886MB/s-886MB/s), io=10.0GiB (10.7GB), run=12113-12113msec
mike@Beastie:/recoverpool$ sudo fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.36
Starting 1 process
Jobs: 1 (f=1)
TEST: (groupid=0, jobs=1): err= 0: pid=952195: Wed Dec 25 21:47:38 2024
  read: IOPS=3712, BW=3713MiB/s (3893MB/s)(10.0GiB/2758msec)
    slat (usec): min=139, max=979, avg=267.24, stdev=43.09
    clat (usec): min=2, max=25408, avg=8253.45, stdev=1047.47
     lat (usec): min=262, max=26324, avg=8520.70, stdev=1076.49
    clat percentiles (usec):
     |  1.00th=[ 5211],  5.00th=[ 7635], 10.00th=[ 7898], 20.00th=[ 8094],
     | 30.00th=[ 8094], 40.00th=[ 8094], 50.00th=[ 8094], 60.00th=[ 8094],
     | 70.00th=[ 8094], 80.00th=[ 8160], 90.00th=[ 9241], 95.00th=[ 9503],
     | 99.00th=[10814], 99.50th=[12387], 99.90th=[20841], 99.95th=[23200],
     | 99.99th=[25035]
   bw (  MiB/s): min= 3144, max= 3870, per=99.24%, avg=3684.80, stdev=307.57, samples=5
   iops        : min= 3144, max= 3870, avg=3684.80, stdev=307.57, samples=5
  lat (usec)   : 4=0.05%, 500=0.05%, 750=0.05%, 1000=0.05%
  lat (msec)   : 2=0.20%, 4=0.39%, 10=97.79%, 20=1.31%, 50=0.12%
  cpu          : usr=0.87%, sys=99.06%, ctx=7, majf=0, minf=8203
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=3713MiB/s (3893MB/s), 3713MiB/s-3713MiB/s (3893MB/s-3893MB/s), io=10.0GiB (10.7GB), run=2758-2758msec

Last but not least is the raidz2 x nine 4TB drives

mike@Beastie:/mediapool1$ sudo fio --name TEST --eta-newline=5s --filename=temp.file --rw=write --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.36
Starting 1 process
TEST: Laying out IO file (1 file / 2048MiB)
Jobs: 1 (f=1): [W(1)][100.0%][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=956898: Wed Dec 25 21:53:04 2024
  write: IOPS=1507, BW=1507MiB/s (1581MB/s)(10.0GiB/6793msec); 0 zone resets
    slat (usec): min=144, max=2958, avg=392.73, stdev=332.38
    clat (usec): min=3, max=2742.1k, avg=20489.16, stdev=149742.52
     lat (usec): min=204, max=2744.0k, avg=20881.89, stdev=149765.98
    clat percentiles (msec):
     |  1.00th=[    7],  5.00th=[    7], 10.00th=[    7], 20.00th=[    7],
     | 30.00th=[    7], 40.00th=[    8], 50.00th=[   13], 60.00th=[   14],
     | 70.00th=[   17], 80.00th=[   19], 90.00th=[   21], 95.00th=[   23],
     | 99.00th=[   28], 99.50th=[   31], 99.90th=[ 2735], 99.95th=[ 2735],
     | 99.99th=[ 2735]
   bw (  MiB/s): min=   36, max= 4162, per=100.00%, avg=2215.33, stdev=1264.87, samples=9
   iops        : min=   36, max= 4162, avg=2215.33, stdev=1264.87, samples=9
  lat (usec)   : 4=0.02%, 10=0.03%, 250=0.03%, 500=0.04%, 750=0.03%
  lat (usec)   : 1000=0.04%
  lat (msec)   : 2=0.15%, 4=0.35%, 10=45.45%, 20=41.57%, 50=11.99%
  lat (msec)   : >=2000=0.30%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=2743.7M, max=2743.7M, avg=2743703200.00, stdev= 0.00
    sync percentiles (msec):
     |  1.00th=[ 2735],  5.00th=[ 2735], 10.00th=[ 2735], 20.00th=[ 2735],
     | 30.00th=[ 2735], 40.00th=[ 2735], 50.00th=[ 2735], 60.00th=[ 2735],
     | 70.00th=[ 2735], 80.00th=[ 2735], 90.00th=[ 2735], 95.00th=[ 2735],
     | 99.00th=[ 2735], 99.50th=[ 2735], 99.90th=[ 2735], 99.95th=[ 2735],
     | 99.99th=[ 2735]
  cpu          : usr=7.64%, sys=37.16%, ctx=2807, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=1507MiB/s (1581MB/s), 1507MiB/s-1507MiB/s (1581MB/s-1581MB/s), io=10.0GiB (10.7GB), run=6793-6793msec
mike@Beastie:/mediapool1$ sudo fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.36
Starting 1 process
Jobs: 1 (f=1)
TEST: (groupid=0, jobs=1): err= 0: pid=956989: Wed Dec 25 21:53:29 2024
  read: IOPS=3707, BW=3707MiB/s (3888MB/s)(10.0GiB/2762msec)
    slat (usec): min=139, max=863, avg=267.76, stdev=33.77
    clat (usec): min=2, max=21293, avg=8274.86, stdev=966.19
     lat (usec): min=256, max=21976, avg=8542.62, stdev=989.42
    clat percentiles (usec):
     |  1.00th=[ 5211],  5.00th=[ 7701], 10.00th=[ 7898], 20.00th=[ 8094],
     | 30.00th=[ 8094], 40.00th=[ 8094], 50.00th=[ 8160], 60.00th=[ 8160],
     | 70.00th=[ 8160], 80.00th=[ 8160], 90.00th=[ 9634], 95.00th=[ 9634],
     | 99.00th=[10028], 99.50th=[10945], 99.90th=[17433], 99.95th=[19268],
     | 99.99th=[21103]
   bw (  MiB/s): min= 3076, max= 3876, per=99.29%, avg=3681.20, stdev=340.20, samples=5
   iops        : min= 3076, max= 3876, avg=3681.20, stdev=340.20, samples=5
  lat (usec)   : 4=0.05%, 500=0.05%, 750=0.05%, 1000=0.05%
  lat (msec)   : 2=0.21%, 4=0.39%, 10=98.17%, 20=1.00%, 50=0.04%
  cpu          : usr=0.54%, sys=99.38%, ctx=6, majf=0, minf=8204
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=3707MiB/s (3888MB/s), 3707MiB/s-3707MiB/s (3888MB/s-3888MB/s), io=10.0GiB (10.7GB), run=2762-2762msec

I don’t know if it matters in comparing the raidz vs raidz2 pools , but mediapool1 is 27% used, versus almost 0% on the recoverpool.
The LVM results didn’t really surprised me as they are SATA and just older drives just sitting around.
But honestly I do like the HGST drives in the raidz2 pool with them being 12 gb/s and the newer AF format. Of course the Seagates aren’t a slouch either at 6gb/s / 512 format, so far.
None of the pools are using any special vdevs etc etc and they are rust drives so I think the results was pretty good.

Nice analysis.

One of my upcoming projects is to put in to use, 2 10TB Toshiba NAS Pro drives for media storage. I haven’t ripped the media backup yet, need to get started on that. I am thinking of doing a ZFS mirror since this data will be static once written. I know mirrors are less efficient because you lose 50% of capacity, but in this case I see it as a 2nd copy.

My Ubuntu Desktop (stripped down DE), that these drives will be used on, already runs root on ZFS. Root is on NVMe with a 2nd NVMe as a ZFS mirror of root. I’ve tested the mirror by pulling one of the NVMe’s and was able to boot right up on the remaining device, zfs list showed a degraded mirror, but it worked. Then I replaced the NVMe & (I’d have to look at my notes, but) was able to restore the mirror and booted like nothing happened. Now, I don’t trust a root mirror by itself, it only protects against hardware failure, not data failure. So, in addition to the mirror, I use snapshots & replicate the snapshots (zfs send) to a backup pool on a daily cron.

ZFS is fun to use. I have copious notes on the whole thing. I got stupid one day and wrecked the system on purpose to test my notes and with some adjustments got back up again. I trust it enough to use on root for my kvm host. The zfs features, to me, outweigh the risk of an update wrecking an install. Probably would not use it in a mission critical production setting, but I am not a mission critical user. Backups solve most problems.

Yes, I’ve went the route of ZFS for just the raidz drives. Leaving the OS without any
ZFS configuration in the OS as I figured this would be simpler in the case of a crash etc etc.
Or if Ubuntu issues out another beta openZFS push /update which I “think” is what snagged you and MALfen in the past.

Mirrors I generally stay away from, as you mentioned 50% capacity loss. But there is advantages in mirrors that can’t be denied. Not a bad way to go (mirrors) depending on one’s needs. And does offer some protection as both drives in that mirror actually have to be offline. Especially in what your doing, BTW are those SATA or SAS drives (I’m assuming SATA).

Right now I’m looking at the hardware side and what to order to add the additional vdevs to the two pools I have.
I’m leaning towards (actually pretty much sold on it) the Adaptec AEC-82885T aka LENOVO 00LF095 expander. 36 ports 12gb/s (IIRC it’s based on Broadway/LSI chipset) can be powered by a simple 4 pin ide molex connection, thus no MB power connection required. It is a either or power being power via pcie bus or via the molex connection not both. Although in way I’ve thought about deploying it in the cases. I’ll lose 16 ports for passthrough (I want to use full duplex connections i.e. two cables in / out) which will leave me with 20 ports (drives).

Which will be fine, the two other cases I have a Antec 1200 with just the power supply to host drives. Can support up to 20 drives depending on the enclosures installed in the 12 - 5.25" bays.
And then the Cooler Master HAF 922 case that the media server is in can offer it’s 5 - 3.5" internal bays, and the 5 - 5.25" bays. Which could offer between 12 to 20 drives there depending on if sticking to SFF, LFF, or a mixture of both enclosures. I’ll probably go to the CM case first as it is running anyway for the media server.
On a side note:
I was using a SAS to SATA right angle adapter to connect to the 12gb/s Seagate drives. So it would fit cleanly in the NFS’s internal bays. It would not work period, unless I stuck the 6gb/s drives in. Which made sense SAS II vs SAS III, although no specs was given for the adapters I can only deduct they are SAS II capable. Just have to make a mental note use only SAS I /II drives with those adapters.

Yep, my Toshibas are SATA. Problem I had was my Rog Strix MB had only 6 onboard SATA connectors but I lost one when I populated the 2nd M.2.

I ended up getting this:

LSI Logic Controller Card LSI00301 SAS 9207-8i 8Port Internal SAS/SATA 6Gb/s PCI Express

Now I can connect up to 8 more SATA drives via the 2 SAS connections on this card.

Lesson learned, I should have researched the MB connectivity options. For what you are doing, my hardware would not be sorely inadequate.

I do have an old unused CoolerMaster tower that has a lot of 5.25 bays, but no 2.5 bays. I just tie-wrap my 2.5s everywhere. Still not enough though for a large NAS.

Yeah the Asus X-99 board I have on the NFS did the same except I lost 2 SATA ports when I installed the M.2 NvME. (same thing on the other two MSi Boards I have)
But was not a problem as I was going SAS for the storage side. The only thing I have plugged in on the SATA ports are 2 - SATA 1Tb and a 2Tb in the LVM.

Looking briefly at your card 6gb/s and eight ports, 256 SAS or SATA drives. You could go with a sff-8088 to sff-8087 adapter to reach outside your case to a expander in your Old cooler master case.
Such as THIS
Or even THIS ONE

You could even go with the Intel RES2SV240 expander but would still require a adapter to reach outside the case.

*Personally I would use the Adaptec AEC-82885T a simple SFF-8643 to SFF-8087 cable will work. But you would still be limited to 6 gb/s because of the HBA. BUT it will GREATLY enhance your capabilities. Both internally and externally one could take your existing setup /card plug two sff-8087 to sff-8643 cables to the last two internal ports on the Adaptec card. That will allow 2 ports outside the case for expansion and 20 drives inside your case. I’ve seen them as low as $20.00 although comes slow boat from China. Which is my plan I’ll probably order three or four at one time so I have them on hand. And yeah you would have to change your cabling to a SFF-8643 to a SATA breakout cable. The other upside is it’s 12gb/s so if you changed to a 12 gb/s HBA it would still work *

Now if you have or plan to build a disk shelf to turn on the power supply in your CM case without motherboard to act as a disk shelf, I’m using This Item for my Antec 1200 case
which in reality as your not powering a GPU or CPU etc etc only hard drives and fans allows one to use a 500 to 600 watt PSU comfortably. As most LSI HBA’s support staggered spin up reducing the power requirements. (I looked in the manual for yours I didn’t see a mention of staggered spin up)
-----OR --------
you could go this route this route
The problem there is the caddys with the correct interposers for mixed SATA SAS drive support. but the connection is a standard SFF-8088 on the bottom controller I’m told (the double dots not diamond / triangle or whatever most call it) Just stay away from the EMC KTN-STL4— They WILL NOT WORK and that makes them cheap from the recycle shops. And Extremely proprietary meaning I’ve not heard of anyone turning them into a simple disk shelf that can be plugged into directly from a Linux machine. Without exotic methods, majority I have seen in their post literally give up. My opinion waste of money and efforts.

As far as rack disk array there are plenty to choose from . Just information in case your not wanting to use the old CM case for a disk shelf.

Ohh and I found This one and was considering it for a bench test setup. nothing fancy and at $18.00 shipped ??? the External port are the 1 x sff8088 using the external port for SAS/SATA , and the internal ports are pure SATA (so 4 additional SATA ports), which could hook up to a disk shelf at narrow width (one cable).
Now it won’t keep up with the 12gb/s controller but for simple SAS formatting and just connecting to SATA drives it’s almost perfect for a test bench. (key word almost, I think it supports up to what 256 drives)
Basically a neat little card to play with.

P.S.
after writing that spill… the idea of two HBA’s came into my head … some say yeah some say no… I do have a couple of LSI HBA sitting here (one is a 16i which lost 1 channel turning it into a 8i, the other is a straight up 8i)… quick shutdown I could check if they bumped in addresses… easy enough test.
But that is just a suggestion on my end you might not have any plans or desires for such as I mentioned. And their maybe easier methods.
Had I actually researched more before Jumping on buying cases for disk shelf /arrays. I probably would have bought a couple or more rack server or disk arrays. From what I can tell the supermicro seems to be the easiest conversion.
Here is a sample of a JBOD array its’ a IBM 2.5" Disk Array. Which honestly I cannot figure a way to build a disk shelf array cheaper. It’s literally populate with your drives and plug in a SFF-8087 cable and go. The caddies are there no hunting required. The only down sides is the drive size the 3.5" LFF allows for bigger pools / Drive sizes. Physical size of the case and possibly noise. At 6gb/s not a good choice for NvME but just some spinning rust it would tote the mail.
( I have seen lots of 12 - 2.5" 900GB SAS drive for a low as $60.00 from recyclers so x 2 $120.00 would fully populate it, would one have to change them from say 520 to 512 yeah maybe, but at $5.00 a drive, so a z1 x 6 wide x 4vdevs would net you 12TB at $9.86 a TB, mirrors is 7.5 TB usable at $15.88 a TB )

Although I guess the big deal on doing a home built disk array is the form factor is smaller than a rack, power draw, and a bit quieter too lol.

Thanks for sharing all your research here! I had not thought of all these possibilities expanding storage not only in the host case but outside as well.

I misspoke earlier when i stated my CM case had a tall rack for 5.25, it is for HDDs (3.5”) drives. But it also has 3 5.25 bays. I wonder id they make a 5.25 to 3.25 adapter so you can fit HDDs in the old optical drive slots…

@aljames
I’m using Athena power enclosures which fit x number of drives into 2 or three bays. Only reason mentioning them they are a a better price point vs Icy Dock, They have both 2.5" and 3.5" drive. For the three bay I have a 4 drive, The 5 drive x 3 bay I was interested in is not available right now. Here is some of the offering although there are more brand etc on ebay etc etc not plugging for one place.
new egg and enclosures

Hope that helps (look around you can find some pretty interesting and less costly options) But I would ensure they are in fact SAS capable.