Identifying and replacing a disk in a zfs pool

Sure you guys go ahead and have fun in my absence…

I’m happy to hear things are better. :slight_smile:

2 Likes

This is what it says now, the format is completed.
I also ran a test.

sudo smartctl -a /dev/sdf
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.11.0-14-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS726060AL4210
Revision:             AAG0
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca271b5d120
Serial number:        K8K6ZT5N
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Tue Feb  4 03:20:20 2025 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     29 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 24685:25
Manufactured in week 14 of year 2019
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  24
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  6808
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 23457345250000896

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0      288         0       288    1319011     380487.898           0
write:         0       11         0        11    1086528      40832.172           0
verify:        0        0         0         0     329704          0.000           0

Non-medium error count:        0

  Pending defect count:0 Pending Defects
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   24685                 - [-   -    -]

Long (extended) Self-test duration: 53828 seconds [15.0 hours]

I would venture a guess that it is still usable the Grown defect list did not increase the past errors remained the same…
The grown defect list is what I usually keep a eye on… but their is some validity in the total errors corrected are you usin ECC Rdimms or just regular Udimms ?

Hang on I’m cranking up (just sent the WOL magic packet) Deepblue (my backup server) to pull one of my smartctl status so you can compare.

Now mine has Quite a few corrected errors mostly because of me doing test with the drive checking ZFS which was on purpose.

root@deepblue:/home/mike# smartctl -a /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-52-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST4000NM0023
Revision:             GE13
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500855fc8c7
Serial number:        Z1ZAVMM6
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Tue Feb  4 03:33:12 2025 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     31 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 65619:58
Manufactured in week 17 of year 2016
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  3395
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  6111
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 2035324647
  Blocks received from initiator = 781105023
  Blocks read from cache and sent to initiator = 2994318995
  Number of read and write commands whose size <= segment size = 3530437749
  Number of read and write commands whose size > segment size = 5820

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 65619.97
  number of minutes until next internal SMART test = 52

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   829513102      576         0  829513678        576    2440693.559           0
write:         0        0         0         0          0     239785.272           0
verify: 3435179573        0         0  3435179573          0       3463.364           0

Non-medium error count:  1097213

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Aborted (device reset ?)   64   65535                 - [-   -    -]
# 2  Default           Completed                  64   35021                 - [-   -    -]
# 3  Default           Completed                  64   15874                 - [-   -    -]
# 4  Default           Completed                  64   15874                 - [-   -    -]
# 5  Reserved(7)       Completed                  64       4                 - [-   -    -]

Long (extended) Self-test duration: 32700 seconds [9.1 hours]

as you can see mine is a bit older has quite a few hours on it but the grown defect list is 0

most of my corrections was handled by the ecc ram but not all like I said I’ve used this drive for testing and interiem duty such as now I will replce these later with a freasher drive set.

this one is from my NFS with HP HGST drives

root@Beastie:/home/mike# smartctl -a /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-52-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              MB4000JEQNL
Revision:             HPD7
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2430eb1c4
Serial number:        NHG82J7N
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Mon Feb  3 21:47:48 2025 CST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     41 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 40920:46
Manufactured in week 45 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  3120
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  4814
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        1         0         0          0     148253.041           0
write:         0        8         0         8          0     102729.910           0
verify:        0        0         0         0          0          0.000           0

Non-medium error count:   712251

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -       7                 - [-   -    -]
# 2  Background short  Completed                   -       3                 - [-   -    -]
# 3  Background short  Completed                   -       2                 - [-   -    -]

Long (extended) Self-test duration: 41040 seconds [11.4 hours]

so yes i would set your drive aside as a cold spare…in the event of a failure … gives you time to obtain a fresh drive … honestly your running raidz2 your probably able to get to the server within a week of a faulted drive so really no actual push for a hot spare in my honest opinion .

but if you want to add it back to the vdev as a hot spare you could.

Actually I seen you was quite the busy bee when we was doing that.
you was doing what … three requests for help? at the same time

@Supermag
I like those R720’s good system before re-purposing Deepblue into a backup server I was looking at that family as well as the T series.
so yeah that answers the ECC ram question