Hi Hans,
the system just rebooted by itself, no crash dump has been created but now I think I'm in this situation
This looks vaguely similar to what we've seen on DELL H755 controllers,
where the controller just drops dead after a while, needing a full power
because of this:
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 16 13:38:31 ba948ca4-681c-4d79-981c-6da3ca6e6b05 PCIEX-8000-DJ Major
Host : pg-1
Platform : ProLiant-ML30-Gen10-Plus Chassis_id : CZJ2360PLT
Product_sn :
Fault class : fault.io.pciex.device-noresp 40%
fault.io.pciex.device-interr 40%
fault.io.pciex.bus-noresp 20%
Affects : dev:////pci@0,0/pci8086,43b8@1c/pci1590,32b@0
faulted and taken out of service
FRU : "PCI-E Slot 4" (hc://:product-id=ProLiant-ML30-Gen10-Plus:server-id=pg-1:chassis-id=CZJ2360PLT/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=3/pciexdev=0)
faulty
Description : A problem has been detected on one of the specified devices or on
one of the specified connecting buses.
information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with
this fault
Action : If a plug-in card is involved check for badly-seated cards or
bent pins. Otherwise schedule a repair procedure to replace the
affected device(s). Use fmadm faulty to identify the devices or
contact your illumos distribution team for support.
PCI-E Slot 4 is where the controller is located.
zpool status shows the pool as online, it is not
zpool status
pool: dati
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
scan: scrub repaired 0 in 0 days 01:07:00 with 0 errors on Fri Feb 16 12:30:01 2024
config:
NAME STATE READ WRITE CKSUM
dati ONLINE 0 39 0
mirror-0 ONLINE 0 45 0
c4t5000CCA85EE5ECB0d0 ONLINE 0 49 0
c4t50014EE6B2513C38d0 ONLINE 0 49 0
mirror-2 ONLINE 0 73 0
c4t5000C500AAF9B0C3d0 ONLINE 0 83 0
c4t5000C500AAF9BF2Fd0 ONLINE 0 83 0
logs
c3t001B448B4A7140BFd0s0 ONLINE 0 0 0
cache
c3t001B448B4A7140BFd0s1 ONLINE 0 0 0
errors: 20 data errors, use '-v' for a list
zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
dati 928G 80.1G 848G - - 12% 8% 1.00x ONLINE -
rpool 222G 138G 83.5G - - 0% 62% 1.00x ONLINE -
with format I don't see the vdevs upon which dati is built.
format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c2t3d0 <SanDisk-SSD PLUS 240GB-UF8704RL-223.57GB>
/pci@0,0/pci1590,28d@17/disk@3,0
1. c2t4d0 <WDC- WDS100T1R0A-68A4W0-411010WR-931.51GB>
/pci@0,0/pci1590,28d@17/disk@4,0
2. c3t001B448B4A7140BFd0 <WD_BLACK-SN770 500GB-731100WD cyl 38932 alt 0 hd 224 sec 112>
/pci@0,0/pci8086,43c4@1b,4/pci15b7,5017@0/blkdev@w001B448B4A7140BF,0
3. c5tACE42E0005CFF480d0 <NVMe-VS000480KXALB-85030G00-447.13GB>
/pci@0,0/pci8086,43b0@1d/pci1c5c,2f3@0/blkdev@wACE42E0005CFF480,0
Specify disk (enter its number): ^C
zpool status -v dati gives this error, probably because it can't read from the pool
errors: List of errors unavailable (insufficient privileges)
The last line of dmesg reads
Feb 16 13:38:43 pg-1 genunix: [ID 390243 kern.info] Creating /etc/devices/retire_store
which contains the ID for the controller.
I'll power it off to see if upon restart it goes back to a working state.
Regards.
Maurilio