Hi Hans,

the system just rebooted by itself, no crash dump has been created but now I think I'm in this situation

This looks vaguely similar to what we've seen on DELL H755 controllers,
where the controller just drops dead after a while, needing a full power
cycle to come back to life. (See https://www.illumos.org/issues/15935)

because of this:

--------------- ------------------------------------ -------------- ---------

TIME EVENT-ID MSG-ID SEVERITY

--------------- ------------------------------------ -------------- ---------

Feb 16 13:38:31 ba948ca4-681c-4d79-981c-6da3ca6e6b05 PCIEX-8000-DJ Major

Host : pg-1

Platform : ProLiant-ML30-Gen10-Plus Chassis_id : CZJ2360PLT

Product_sn :

Fault class : fault.io.pciex.device-noresp 40%

fault.io.pciex.device-interr 40%

fault.io.pciex.bus-noresp 20%

Affects : dev:////pci@0,0/pci8086,43b8@1c/pci1590,32b@0

faulted and taken out of service

FRU : "PCI-E Slot 4" (hc://:product-id=ProLiant-ML30-Gen10-Plus:server-id=pg-1:chassis-id=CZJ2360PLT/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=3/pciexdev=0)

faulty

Description : A problem has been detected on one of the specified devices or on

one of the specified connecting buses.

Refer to http://illumos.org/msg/PCIEX-8000-DJ for more

information.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with

this fault

Action : If a plug-in card is involved check for badly-seated cards or

bent pins. Otherwise schedule a repair procedure to replace the

affected device(s). Use fmadm faulty to identify the devices or

contact your illumos distribution team for support.

PCI-E Slot 4 is where the controller is located.

zpool status shows the pool as online, it is not

zpool status

pool: dati

state: ONLINE

status: One or more devices are faulted in response to IO failures.

action: Make sure the affected devices are connected, then run 'zpool clear'.

see: http://illumos.org/msg/ZFS-8000-HC

scan: scrub repaired 0 in 0 days 01:07:00 with 0 errors on Fri Feb 16 12:30:01 2024

config:

NAME STATE READ WRITE CKSUM

dati ONLINE 0 39 0

mirror-0 ONLINE 0 45 0

c4t5000CCA85EE5ECB0d0 ONLINE 0 49 0

c4t50014EE6B2513C38d0 ONLINE 0 49 0

mirror-2 ONLINE 0 73 0

c4t5000C500AAF9B0C3d0 ONLINE 0 83 0

c4t5000C500AAF9BF2Fd0 ONLINE 0 83 0

logs

c3t001B448B4A7140BFd0s0 ONLINE 0 0 0

cache

c3t001B448B4A7140BFd0s1 ONLINE 0 0 0

errors: 20 data errors, use '-v' for a list

zpool list

NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT

dati 928G 80.1G 848G - - 12% 8% 1.00x ONLINE -

rpool 222G 138G 83.5G - - 0% 62% 1.00x ONLINE -

with format I don't see the vdevs upon which dati is built.

format

Searching for disks...done

AVAILABLE DISK SELECTIONS:

0. c2t3d0 <SanDisk-SSD PLUS 240GB-UF8704RL-223.57GB>

/pci@0,0/pci1590,28d@17/disk@3,0

1. c2t4d0 <WDC- WDS100T1R0A-68A4W0-411010WR-931.51GB>

/pci@0,0/pci1590,28d@17/disk@4,0

2. c3t001B448B4A7140BFd0 <WD_BLACK-SN770 500GB-731100WD cyl 38932 alt 0 hd 224 sec 112>

/pci@0,0/pci8086,43c4@1b,4/pci15b7,5017@0/blkdev@w001B448B4A7140BF,0

3. c5tACE42E0005CFF480d0 <NVMe-VS000480KXALB-85030G00-447.13GB>

/pci@0,0/pci8086,43b0@1d/pci1c5c,2f3@0/blkdev@wACE42E0005CFF480,0

Specify disk (enter its number): ^C

zpool status -v dati gives this error, probably because it can't read from the pool

errors: List of errors unavailable (insufficient privileges)

The last line of dmesg reads

Feb 16 13:38:43 pg-1 genunix: [ID 390243 kern.info] Creating /etc/devices/retire_store

which contains the ID for the controller.

I'll power it off to see if upon restart it goes back to a working state.

Regards.

Maurilio