public inbox for discuss@lists.illumos.org (since 2011-08)
 help / color / mirror / Atom feed
* Disk resilvering problem
@ 2024-07-08 10:38 Udo Grabowski (IMK)
  2024-07-08 10:48 ` [discuss] " Toomas Soome
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Udo Grabowski (IMK) @ 2024-07-08 10:38 UTC (permalink / raw)
  To: discuss

[-- Attachment #1: Type: text/plain, Size: 1081 bytes --]

Hi,

we currently have a raid-z1 pool resilvering (two damaged devices
in different vdevs), but a third disk in one of the degraded vdevs
occasionally timeouts:

Jul  7 06:25:36 imksunth8 scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci 
(scsi_vhci0):
Jul  7 06:25:36 imksunth8       /scsi_vhci/disk@g5000cca2441ed63c (sd184): 
Command Timeout on path mpt_sas4/disk@w5000cca2441ed63d,0

The problem: These hiccups cause the resilvering to RESTART ! Which
doesn't help to get the job quickly done, and just accelerates the
wear on the already unhealthy third disk, and will finally spiral down
to complete dataloss because of 2-disk-failure on a z1 vdev.

Is there a way to switchoff this behaviour via a kmdb parameter which
can be set while operating (it's an older illumos-cf25223258 from 2016) ?
-- 
Dr.Udo Grabowski  Inst.of Meteorology & Climate Research IMK-ASF-SAT
https://www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology          https://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5804 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [discuss] Disk resilvering problem
  2024-07-08 10:38 Disk resilvering problem Udo Grabowski (IMK)
@ 2024-07-08 10:48 ` Toomas Soome
  2024-07-08 11:11   ` bronkoo
  2024-07-08 11:36 ` Keith Hall
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Toomas Soome @ 2024-07-08 10:48 UTC (permalink / raw)
  To: illumos-discuss



> On 8. Jul 2024, at 13:38, Udo Grabowski (IMK) <udo.grabowski@kit.edu> wrote:
> 
> Hi,
> 
> we currently have a raid-z1 pool resilvering (two damaged devices
> in different vdevs), but a third disk in one of the degraded vdevs
> occasionally timeouts:
> 
> Jul  7 06:25:36 imksunth8 scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
> Jul  7 06:25:36 imksunth8       /scsi_vhci/disk@g5000cca2441ed63c (sd184): Command Timeout on path mpt_sas4/disk@w5000cca2441ed63d,0
> 
> The problem: These hiccups cause the resilvering to RESTART ! Which
> doesn't help to get the job quickly done, and just accelerates the
> wear on the already unhealthy third disk, and will finally spiral down
> to complete dataloss because of 2-disk-failure on a z1 vdev.
> 
> Is there a way to switchoff this behaviour via a kmdb parameter which
> can be set while operating (it's an older illumos-cf25223258 from 2016) ?
> -- 
> Dr.Udo Grabowski  Inst.of Meteorology & Climate Research IMK-ASF-SAT
> https://www.imk-asf.kit.edu/english/sat.php
> KIT - Karlsruhe Institute of Technology          https://www.kit.edu
> Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026
> 


I would use live boot with more recent image and get resilver done (hopefully faster). 2016 is very old setup and you are basically missing all improvements done with resilver code….

rgds,
toomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [discuss] Disk resilvering problem
  2024-07-08 10:48 ` [discuss] " Toomas Soome
@ 2024-07-08 11:11   ` bronkoo
  0 siblings, 0 replies; 11+ messages in thread
From: bronkoo @ 2024-07-08 11:11 UTC (permalink / raw)
  To: illumos-discuss

[-- Attachment #1: Type: text/plain, Size: 237 bytes --]

Belief me, at the end of your journey you have to use your data backup.
Start to rebuild your pool by mirrored vdevs only and copy, copy, copy... this way you save time for your research.

-- 
Sapere aude
https://klima-wahrheiten.de

[-- Attachment #2: Type: text/html, Size: 400 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [discuss] Disk resilvering problem
  2024-07-08 10:38 Disk resilvering problem Udo Grabowski (IMK)
  2024-07-08 10:48 ` [discuss] " Toomas Soome
@ 2024-07-08 11:36 ` Keith Hall
  2024-07-08 12:05   ` Marcel Telka
  2024-07-08 16:10   ` Udo Grabowski (IMK)
  2024-07-08 16:04 ` Bill Sommerfeld
  2024-08-02 12:53 ` Udo Grabowski (IMK)
  3 siblings, 2 replies; 11+ messages in thread
From: Keith Hall @ 2024-07-08 11:36 UTC (permalink / raw)
  To: illumos-discuss

Can you use ddrescue to copy the data off the "unhealthy third disk" to another good 'donor' drive, that can then be used for the resilvering process without the timeouts problem (assuming timeouts are drive not controller/cable related)? 

ddrescue will continue after retry with all uncopied data, not copying stuff already copied.

Keith

-----Original Message-----
From: Udo Grabowski (IMK) <udo.grabowski@kit.edu> 
Sent: 08 July 2024 11:38
To: discuss@lists.illumos.org
Subject: [discuss] Disk resilvering problem

Hi,

we currently have a raid-z1 pool resilvering (two damaged devices in different vdevs), but a third disk in one of the degraded vdevs occasionally timeouts:

Jul  7 06:25:36 imksunth8 scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci
(scsi_vhci0):
Jul  7 06:25:36 imksunth8       /scsi_vhci/disk@g5000cca2441ed63c (sd184): 
Command Timeout on path mpt_sas4/disk@w5000cca2441ed63d,0

The problem: These hiccups cause the resilvering to RESTART ! Which doesn't help to get the job quickly done, and just accelerates the wear on the already unhealthy third disk, and will finally spiral down to complete dataloss because of 2-disk-failure on a z1 vdev.

Is there a way to switchoff this behaviour via a kmdb parameter which can be set while operating (it's an older illumos-cf25223258 from 2016) ?
--
Dr.Udo Grabowski  Inst.of Meteorology & Climate Research IMK-ASF-SAT https://www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology          https://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [discuss] Disk resilvering problem
  2024-07-08 11:36 ` Keith Hall
@ 2024-07-08 12:05   ` Marcel Telka
  2024-07-08 16:10   ` Udo Grabowski (IMK)
  1 sibling, 0 replies; 11+ messages in thread
From: Marcel Telka @ 2024-07-08 12:05 UTC (permalink / raw)
  To: illumos-discuss

On Mon, Jul 08, 2024 at 11:36:40AM +0000, Keith Hall via illumos-discuss wrote:
> Can you use ddrescue to copy the data off the "unhealthy third disk" to another good 'donor' drive, that can then be used for the resilvering process without the timeouts problem (assuming timeouts are drive not controller/cable related)? 
> 
> ddrescue will continue after retry with all uncopied data, not copying stuff already copied.
> 
> Keith
> 
> -----Original Message-----
> From: Udo Grabowski (IMK) <udo.grabowski@kit.edu> 
> Sent: 08 July 2024 11:38
> To: discuss@lists.illumos.org
> Subject: [discuss] Disk resilvering problem
> 

[...snip...]

> The problem: These hiccups cause the resilvering to RESTART ! Which
> doesn't help to get the job quickly done, and just accelerates the
> wear on the already unhealthy third disk, and will finally spiral down
> to complete dataloss because of 2-disk-failure on a z1 vdev.

I saw similar situation several times.  In our use case if we see
resilver restarts on a production machine and all attempts to solve the
restarts failed we just start to move data to another zfs pool until the
resilver completes or the pool is emptied (and so the resilver completes
fast because it has basically nothing to do).

-- 
+-------------------------------------------+
| Marcel Telka   e-mail:   marcel@telka.sk  |
|                homepage: http://telka.sk/ |
+-------------------------------------------+

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [discuss] Disk resilvering problem
  2024-07-08 10:38 Disk resilvering problem Udo Grabowski (IMK)
  2024-07-08 10:48 ` [discuss] " Toomas Soome
  2024-07-08 11:36 ` Keith Hall
@ 2024-07-08 16:04 ` Bill Sommerfeld
  2024-08-02 12:53 ` Udo Grabowski (IMK)
  3 siblings, 0 replies; 11+ messages in thread
From: Bill Sommerfeld @ 2024-07-08 16:04 UTC (permalink / raw)
  To: discuss

On 7/8/24 03:38, Udo Grabowski (IMK) wrote:
> Hi,
> 
> we currently have a raid-z1 pool resilvering (two damaged devices
> in different vdevs), but a third disk in one of the degraded vdevs
> occasionally timeouts:
> 
> Jul  7 06:25:36 imksunth8 scsi: [ID 243001 kern.warning] WARNING: / 
> scsi_vhci (scsi_vhci0):
> Jul  7 06:25:36 imksunth8       /scsi_vhci/disk@g5000cca2441ed63c 
> (sd184): Command Timeout on path mpt_sas4/disk@w5000cca2441ed63d,0
> 
> The problem: These hiccups cause the resilvering to RESTART ! Which
> doesn't help to get the job quickly done, and just accelerates the
> wear on the already unhealthy third disk, and will finally spiral down
> to complete dataloss because of 2-disk-failure on a z1 vdev.
> 
> Is there a way to switchoff this behaviour via a kmdb parameter which
> can be set while operating (it's an older illumos-cf25223258 from 2016) ?
Note that there have been a bunch of fixes since 2016 which improve 
behavior in this scenario; with more recent illumos the in-progress 
resilver should run to completion and then (at least in some cases) 
another resilver pass will run to clean up errors detected while the 
first was running.

165c5c6fe7 12774 Resilver restarts unnecessarily when it encounters errors
0c06d385ea 12636 Prevent unnecessary resilver restarts
233f6c4995 disabled resilver_defer feature leads to looping resilverse
e4c795beb3 10952 defer new resilvers and misc. resilver-related fixes

							- Bill


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [discuss] Disk resilvering problem
  2024-07-08 11:36 ` Keith Hall
  2024-07-08 12:05   ` Marcel Telka
@ 2024-07-08 16:10   ` Udo Grabowski (IMK)
  2024-07-08 17:10     ` Keith Hall
  2024-07-08 17:17     ` Toomas Soome
  1 sibling, 2 replies; 11+ messages in thread
From: Udo Grabowski (IMK) @ 2024-07-08 16:10 UTC (permalink / raw)
  To: discuss

[-- Attachment #1: Type: text/plain, Size: 1808 bytes --]

On 08/07/2024 13:36, Keith Hall via illumos-discuss wrote:
> Can you use ddrescue to copy the data off the "unhealthy third disk" to another good 'donor' drive, that can then be used for the resilvering process without the timeouts problem (assuming timeouts are drive not controller/cable related)?
>
> ddrescue will continue after retry with all uncopied data, not copying stuff already copied.
>

Startet that, but the question here is how to reinsert the disk copy
into the pool, since the devices are recorded in the (checksum protected)
label. Will zpool export/zpool import fix that when the defect disk is
pulled and the donor has the wrong path/phys_path/devid/fru (compared to
the copied label) ?

# zdb -l /dev/dsk/c0t5000CCA2441F3258d0s0  #(donor disk)
....
         children[0]:
             type: 'spare'
             id: 0
             guid: 14397730801213461149
             whole_disk: 0
             create_txg: 4
             children[0]:
                 type: 'disk'
                 id: 0
                 guid: 4559381366945349747
                 path: '/dev/dsk/c0t5000CCA2441ED19Cd0s0'
                 devid: 'id1,sd@n5000cca2441ed19c/a'
                 phys_path: '/scsi_vhci/disk@g5000cca2441ed19c:a'
                 fru: 
'hc://:product-id=DataON-CIB-9470V2-E0:server-id=:chassis-id=500a0d10000eee3f:serial=N8GJYE7Y:part=HGST-HUS726040AL5210:revision=A907/ses-enclosure=0/bay=0/disk=0'
                 whole_disk: 1
                 DTL: 109
                 create_txg: 4
....

-- 
Dr.Udo Grabowski  Inst.of Meteorology & Climate Research IMK-ASF-SAT
https://www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology          https://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5804 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [discuss] Disk resilvering problem
  2024-07-08 16:10   ` Udo Grabowski (IMK)
@ 2024-07-08 17:10     ` Keith Hall
  2024-07-08 17:17     ` Toomas Soome
  1 sibling, 0 replies; 11+ messages in thread
From: Keith Hall @ 2024-07-08 17:10 UTC (permalink / raw)
  To: illumos-discuss

I can't answer that specific question I'm afraid, would have to defer to the more knowledgable, however I would have thought there'd be a way it can get updated, and at least you'd have a safe copy of the data anyway should that drive go entirely.

Booting a live/rescue cd with more updated resilvering code as others have suggested would be the next best step - if that works, fine, if not, at least you have a copy of the raw data. 

Either way it's in the realms of data recovery territory and the usual caveats apply (double/triple check source/destinations/everything you type etc, be prepared for it to not work)!

-----Original Message-----
From: Udo Grabowski (IMK) <udo.grabowski@kit.edu> 
Sent: 08 July 2024 17:11
To: discuss@lists.illumos.org
Subject: Re: [discuss] Disk resilvering problem

On 08/07/2024 13:36, Keith Hall via illumos-discuss wrote:
> Can you use ddrescue to copy the data off the "unhealthy third disk" to another good 'donor' drive, that can then be used for the resilvering process without the timeouts problem (assuming timeouts are drive not controller/cable related)?
>
> ddrescue will continue after retry with all uncopied data, not copying stuff already copied.
>

Startet that, but the question here is how to reinsert the disk copy into the pool, since the devices are recorded in the (checksum protected) label. Will zpool export/zpool import fix that when the defect disk is pulled and the donor has the wrong path/phys_path/devid/fru (compared to the copied label) ?

# zdb -l /dev/dsk/c0t5000CCA2441F3258d0s0  #(donor disk) ....
         children[0]:
             type: 'spare'
             id: 0
             guid: 14397730801213461149
             whole_disk: 0
             create_txg: 4
             children[0]:
                 type: 'disk'
                 id: 0
                 guid: 4559381366945349747
                 path: '/dev/dsk/c0t5000CCA2441ED19Cd0s0'
                 devid: 'id1,sd@n5000cca2441ed19c/a'
                 phys_path: '/scsi_vhci/disk@g5000cca2441ed19c:a'
                 fru: 
'hc://:product-id=DataON-CIB-9470V2-E0:server-id=:chassis-id=500a0d10000eee3f:serial=N8GJYE7Y:part=HGST-HUS726040AL5210:revision=A907/ses-enclosure=0/bay=0/disk=0'
                 whole_disk: 1
                 DTL: 109
                 create_txg: 4
....

--
Dr.Udo Grabowski  Inst.of Meteorology & Climate Research IMK-ASF-SAT https://www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology          https://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [discuss] Disk resilvering problem
  2024-07-08 16:10   ` Udo Grabowski (IMK)
  2024-07-08 17:10     ` Keith Hall
@ 2024-07-08 17:17     ` Toomas Soome
  1 sibling, 0 replies; 11+ messages in thread
From: Toomas Soome @ 2024-07-08 17:17 UTC (permalink / raw)
  To: illumos-discuss



> On 8. Jul 2024, at 19:10, Udo Grabowski (IMK) <udo.grabowski@kit.edu> wrote:
> 
> On 08/07/2024 13:36, Keith Hall via illumos-discuss wrote:
>> Can you use ddrescue to copy the data off the "unhealthy third disk" to another good 'donor' drive, that can then be used for the resilvering process without the timeouts problem (assuming timeouts are drive not controller/cable related)?
>> 
>> ddrescue will continue after retry with all uncopied data, not copying stuff already copied.
>> 
> 
> Startet that, but the question here is how to reinsert the disk copy
> into the pool, since the devices are recorded in the (checksum protected)
> label. Will zpool export/zpool import fix that when the defect disk is
> pulled and the donor has the wrong path/phys_path/devid/fru (compared to
> the copied label) ?
> 

If you did clone whole partition, with pool labels, then you should be ok and pool import will update device paths etc. I’m not exactly sure about fru, but whatever it did add, it should also be able to update it too.

rgds,
toomas

> # zdb -l /dev/dsk/c0t5000CCA2441F3258d0s0  #(donor disk)
> ....
>        children[0]:
>            type: 'spare'
>            id: 0
>            guid: 14397730801213461149
>            whole_disk: 0
>            create_txg: 4
>            children[0]:
>                type: 'disk'
>                id: 0
>                guid: 4559381366945349747
>                path: '/dev/dsk/c0t5000CCA2441ED19Cd0s0'
>                devid: 'id1,sd@n5000cca2441ed19c/a'
>                phys_path: '/scsi_vhci/disk@g5000cca2441ed19c:a'
>                fru: 'hc://:product-id=DataON-CIB-9470V2-E0:server-id=:chassis-id=500a0d10000eee3f:serial=N8GJYE7Y:part=HGST-HUS726040AL5210:revision=A907/ses-enclosure=0/bay=0/disk=0'
>                whole_disk: 1
>                DTL: 109
>                create_txg: 4
> ....
> 
> -- 
> Dr.Udo Grabowski  Inst.of Meteorology & Climate Research IMK-ASF-SAT
> https://www.imk-asf.kit.edu/english/sat.php
> KIT - Karlsruhe Institute of Technology          https://www.kit.edu
> Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026
> 
> 
> ------------------------------------------
> illumos: illumos-discuss
> Permalink: https://illumos.topicbox.com/groups/discuss/T2a32a4cc427e4845-M47063be192fb1724f0bf62b0
> Delivery options: https://illumos.topicbox.com/groups/discuss/subscription


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [discuss] Disk resilvering problem
  2024-07-08 10:38 Disk resilvering problem Udo Grabowski (IMK)
                   ` (2 preceding siblings ...)
  2024-07-08 16:04 ` Bill Sommerfeld
@ 2024-08-02 12:53 ` Udo Grabowski (IMK)
  2024-08-02 18:43   ` Keith Hall
  3 siblings, 1 reply; 11+ messages in thread
From: Udo Grabowski (IMK) @ 2024-08-02 12:53 UTC (permalink / raw)
  To: discuss

[-- Attachment #1: Type: text/plain, Size: 4398 bytes --]

On 08/07/2024 12:38, Udo Grabowski (IMK) wrote:
> Hi,
>
> we currently have a raid-z1 pool resilvering (two damaged devices
> in different vdevs), but a third disk in one of the degraded vdevs
> occasionally timeouts:
>
> Jul  7 06:25:36 imksunth8 scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci
> (scsi_vhci0):
> Jul  7 06:25:36 imksunth8       /scsi_vhci/disk@g5000cca2441ed63c (sd184):
> Command Timeout on path mpt_sas4/disk@w5000cca2441ed63d,0
>
> The problem: These hiccups cause the resilvering to RESTART ! Which
> doesn't help to get the job quickly done, and just accelerates the
> wear on the already unhealthy third disk, and will finally spiral down
> to complete dataloss because of 2-disk-failure on a z1 vdev.
>
> Is there a way to switchoff this behaviour via a kmdb parameter which
> can be set while operating (it's an older illumos-cf25223258 from 2016) ?
>
>

Thanks to Toomas, Bronkoo, Keith, Marcel, and Bill for all the suggestions
to get this pool rescued - I had to use all of the mentioned options !

Especially the ddrescue disk clone tip from Keith was worth a ton of gold,
since that saved me in the very last moment from loosing the whole pool.
Cloned, checked the label directly before exchange for matching TXGs
again, and the exchange went smoothly, the label got updated correctly.

Indeed, after that I lost another 2nd disk in a vdev, and this time
that gave me 35k errors (interestingly, the pool didn't block), but all
except one single file was recoverable ! All in all 4 disks died in that
process, and 4 additional disks are considerably broken, but luckily all
spreaded across the vdevs, so no immediate problem if they fail now.

I started backup copy operations immediately, (despite the resilvering pool)
which are still ongoing, as there are 200 TB to save to a space I had to find
first, since we have no backup facility of that size affordable (the data
was ment to be in a temporary state that would be transformed and transferred to
an archive later, but it lasted a bit too long on that pool), so a bit of
begging and convincing central operations staff was necessary ...

As I can't 'zpool replace' the rest of the broken disks since the
broken expander chip causes all other disks to spit errors as hell,
I would go down the ddrescue path again to clone and directly replace
those disks, that seems to be a recommendable practice in such cases,
as the rest of the vdev disks are not accessed, but pool writes should be
strictly supressed during such an operation to have consistent TXGs.

I also found that when a disk replacement goes wild very early, its
the best thing to immediately pull the replacement and put back in
the disk to be replaced. That saved me from loosing all, in another
event.

Since the machine (an 8 year old DataOn CIB-9470V2 dual node
storage server) now also has failing 12V/5V power rails, that one
has to go now. Trying to get an all-flash dual node one, the NVME
ones are now in a reachable-for-us price region, and would be MUCH
faster.

So these are the essential lessons learned from this experience:

1.-10. BACKUP, if you can ...

11. Nothing is as permanent as temporary data ...

12. Check your disks health regularly, either smartmontools or a
     regular, but not too often applied scrub. Disk problems seem
     to often pile up silently, causing havoc and desaster when a
     more serious, but otherwise recoverable event arises.

13. In case of a 2nd disk failure in a vdev on the rise, clone the
     the worst one (or the one that is still accessible) before the
     fatal failure of both disks. smartmontools are your friend.

14. Do not export a failing pool. If you have to export, try all
     measures to get it into an easily recoverable state before.
     And do not switchoff the machine in panic.

15. Stop a disk replacement immediately if that goes downhill early.

16. Be patient. Let ZFS do its thing. It does it well.

17. ZFS rocks, it's simply the safest place on earth for your data !

Thanks again for all that valuable help !
-- 
Dr.Udo Grabowski  Inst.of Meteorology & Climate Research IMK-ASF-SAT
https://www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology          https://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5804 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [discuss] Disk resilvering problem
  2024-08-02 12:53 ` Udo Grabowski (IMK)
@ 2024-08-02 18:43   ` Keith Hall
  0 siblings, 0 replies; 11+ messages in thread
From: Keith Hall @ 2024-08-02 18:43 UTC (permalink / raw)
  To: illumos-discuss

You're welcome. ddrescue has saved the day enough times for me to not share the concept! 

Scrubbing is also very underrated! I spent a good overnight recovery session from tape backups back in the early 2000's (before ddrescue was even a concept) because a raid 5 with hot spare drive array had a disk error and unfortunately one of the other data devices that was being used to rebuild the spare also had a bad sector issue towards the end of reconstruction which caused the whole thing to go down hours into a rebuild. Regular scrubbing would have caught it.

I have my main zfs box running Samsung PM1643a Enterprise SSDs with scrubbing weekly, and nightly zfs sends to spinning rust mirror as backup, fingers crossed that's enough, short of doing off-site storage.

I did have a recent scare though, I think after 444 days of uptime there must have been a kernel memory corruption or leak or something in the mpt_sas driver that caused the pool to drop a device and go degraded...

Feb  4 04:12:12 box scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@15/pci15d9,691@0 (mpt_sas0):
Feb  4 04:12:12 box     unable to kmem_alloc enough memory for scatter/gather list
Feb  4 04:13:11 box fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major
Feb  4 04:13:11 box EVENT-TIME: Sun Feb  4 04:13:07 GMT 2024
Feb  4 04:13:11 box PLATFORM: VMware-Virtual-Platform, CSN: VMware-56-4d-d5-d0-8b-95-d8-73-3f-c1-70-75-09-e7-6d-76, HOSTNAME: box
Feb  4 04:13:11 box SOURCE: zfs-diagnosis, REV: 1.0
Feb  4 04:13:11 box EVENT-ID: f4fc2f70-e629-4650-97a2-d6e6335ea6d4
Feb  4 04:13:11 box DESC: The number of checksum errors associated with a ZFS device
Feb  4 04:13:11 box exceeded acceptable levels.  Refer to http://illumos.org/msg/ZFS-8000-GH for more information.
Feb  4 04:13:11 box AUTO-RESPONSE: The device has been marked as degraded.  An attempt
Feb  4 04:13:11 box will be made to activate a hot spare if available.
Feb  4 04:13:11 box IMPACT: Fault tolerance of the pool may be compromised.
Feb  4 04:13:11 box REC-ACTION: Run 'zpool status -x' and replace the bad device.
Feb  4 04:15:02 box nvidia_modeset: [ID 107833 kern.notice] Unloading

Ironically this was during a scrub!

I started thinking the worst, that the expensive SSDs were about to all fail similar age, but then saw the kmem_alloc error and realised then it wasn't hardware... but could there have been a consequential problem caused by this?!

Thankfully a reboot and zpool clear was enough, and an immediate scrub completed ok.

Keith

-----Original Message-----
From: Udo Grabowski (IMK) <udo.grabowski@kit.edu> 
Sent: 02 August 2024 13:53
To: discuss@lists.illumos.org
Subject: Re: [discuss] Disk resilvering problem

On 08/07/2024 12:38, Udo Grabowski (IMK) wrote:
> Hi,
>
> we currently have a raid-z1 pool resilvering (two damaged devices in 
> different vdevs), but a third disk in one of the degraded vdevs 
> occasionally timeouts:
>
>
> The problem: These hiccups cause the resilvering to RESTART ! Which 
> doesn't help to get the job quickly done, and just accelerates the 
> wear on the already unhealthy third disk, and will finally spiral down 
> to complete dataloss because of 2-disk-failure on a z1 vdev.

Thanks to Toomas, Bronkoo, Keith, Marcel, and Bill for all the suggestions to get this pool rescued - I had to use all of the mentioned options !

Especially the ddrescue disk clone tip from Keith was worth a ton of gold, since that saved me in the very last moment from loosing the whole pool.
Cloned, checked the label directly before exchange for matching TXGs again, and the exchange went smoothly, the label got updated correctly.

Indeed, after that I lost another 2nd disk in a vdev, and this time that gave me 35k errors (interestingly, the pool didn't block), but all except one single file was recoverable ! All in all 4 disks died in that process, and 4 additional disks are considerably broken, but luckily all spreaded across the vdevs, so no immediate problem if they fail now.

I started backup copy operations immediately, (despite the resilvering pool) which are still ongoing, as there are 200 TB to save to a space I had to find first, since we have no backup facility of that size affordable (the data was ment to be in a temporary state that would be transformed and transferred to an archive later, but it lasted a bit too long on that pool), so a bit of begging and convincing central operations staff was necessary ...

As I can't 'zpool replace' the rest of the broken disks since the broken expander chip causes all other disks to spit errors as hell, I would go down the ddrescue path again to clone and directly replace those disks, that seems to be a recommendable practice in such cases, as the rest of the vdev disks are not accessed, but pool writes should be strictly supressed during such an operation to have consistent TXGs.

I also found that when a disk replacement goes wild very early, its the best thing to immediately pull the replacement and put back in the disk to be replaced. That saved me from loosing all, in another event.

Since the machine (an 8 year old DataOn CIB-9470V2 dual node storage server) now also has failing 12V/5V power rails, that one has to go now. Trying to get an all-flash dual node one, the NVME ones are now in a reachable-for-us price region, and would be MUCH faster.

So these are the essential lessons learned from this experience:

1.-10. BACKUP, if you can ...

11. Nothing is as permanent as temporary data ...

12. Check your disks health regularly, either smartmontools or a
     regular, but not too often applied scrub. Disk problems seem
     to often pile up silently, causing havoc and desaster when a
     more serious, but otherwise recoverable event arises.

13. In case of a 2nd disk failure in a vdev on the rise, clone the
     the worst one (or the one that is still accessible) before the
     fatal failure of both disks. smartmontools are your friend.

14. Do not export a failing pool. If you have to export, try all
     measures to get it into an easily recoverable state before.
     And do not switchoff the machine in panic.

15. Stop a disk replacement immediately if that goes downhill early.

16. Be patient. Let ZFS do its thing. It does it well.

17. ZFS rocks, it's simply the safest place on earth for your data !

Thanks again for all that valuable help !
--
Dr.Udo Grabowski  Inst.of Meteorology & Climate Research IMK-ASF-SAT https://www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology          https://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-08-02 18:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-08 10:38 Disk resilvering problem Udo Grabowski (IMK)
2024-07-08 10:48 ` [discuss] " Toomas Soome
2024-07-08 11:11   ` bronkoo
2024-07-08 11:36 ` Keith Hall
2024-07-08 12:05   ` Marcel Telka
2024-07-08 16:10   ` Udo Grabowski (IMK)
2024-07-08 17:10     ` Keith Hall
2024-07-08 17:17     ` Toomas Soome
2024-07-08 16:04 ` Bill Sommerfeld
2024-08-02 12:53 ` Udo Grabowski (IMK)
2024-08-02 18:43   ` Keith Hall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).