So, after upgrading the controller's firmware I did reboot the computer and restart my tests which ended a few hours laters with the disks in a state similar to this:

extended device statistics ---- errors ---

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device

0.0 3.0 0.0 0.1 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c2

0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c2t3d0

0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c2t4d0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c3

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c3t001B448B4A7140BFd0

0.0 0.0 0.0 0.0 0.0 13.0 0.0 0.0 0 400 0 0 0 0 c4

0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0 100 0 0 0 0 c4t50014EE6B2513C38d0

0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0 100 0 0 0 0 c4t5000CCA85EE5ECB0d0

0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0 0 0 0 c4t5000C500AAF9B0C3d0

0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0 100 0 0 0 0 c4t5000C500AAF9BF2Fd0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c5

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c5tACE42E0005CFF480d0

0.0 0.0 0.0 0.0 246.0 13.0 0.0 0.0 100 100 0 0 0 0 dati

0.0 0.0 0.0 0.0 0.2 0.1 0.0 0.0 2 2 0 0 0 0 rpool

extended device statistics ---- errors ---

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device

0.0 15.0 0.0 0.4 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c2

0.0 7.0 0.0 0.2 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c2t3d0

0.0 8.0 0.0 0.2 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c2t4d0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c3

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c3t001B448B4A7140BFd0

0.0 0.0 0.0 0.0 0.0 13.0 0.0 0.0 0 400 0 0 0 0 c4

0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0 100 0 0 0 0 c4t50014EE6B2513C38d0

0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0 100 0 0 0 0 c4t5000CCA85EE5ECB0d0

0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0 0 0 0 c4t5000C500AAF9B0C3d0

0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0 100 0 0 0 0 c4t5000C500AAF9BF2Fd0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c5

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c5tACE42E0005CFF480d0

0.0 0.0 0.0 0.0 246.0 13.0 0.0 0.0 100 100 0 0 0 0 dati

0.0 0.0 0.0 0.0 0.3 0.2 0.0 0.0 4 5 0 0 0 0 rpool

So I powered off the unit and restarted it but this time I executed the command "mdb -K" followed by ":c" on the console, to have the kernel debugger ready for the next lockup.

Instead my unit spent the next two days running my tests (iozone + zfs recv + zpool scrub) without any problem.

Given I needed a crash dump during the lockup, I decided to re-enable the onboard ILO5 to be able to generate an NMI and started again my tests (without the mdb -K part) and went on with my other chores.

Well, it took less than three hours for the disks to become stuck again, just like above, but this time I've generated a NMI and I have a crash dump here:

::msgbuf ends with this lmrc warning/error

NOTICE: lmrc0: Drive 00(e252/Port 1I Box 0 Bay 0) Path 300062b20dde3640 reset (Type 03)

NOTICE: lmrc0: unknown AEN received, seqnum = 19954, timestamp = 761927783, code = 27f, locale = 2, class = 0, argtype = 10

NOTICE: lmrc0: Drive 00(e252/Port 1I Box 0 Bay 0) link speed changed

I hope this can shed some light on the problem because it seems that running with kernel debugger active it alters something (timings?) just that little that is needed to make the system (a lot more) solid.

I have, very seldomly in the past two weeks, been able to run my tests for 48 straight hours without issues.