panic inside lmrc driver

public inbox for developer@lists.illumos.org (since 2011-08)
 help / color / mirror / Atom feed

* panic inside lmrc driver
@ 2024-02-07 10:31 maurilio.longo
  2024-02-07 11:14 ` [developer] " Peter Tribble
  2024-02-07 11:16 ` Hans Rosenfeld
  0 siblings, 2 replies; 37+ messages in thread
From: maurilio.longo @ 2024-02-07 10:31 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 541 bytes --]

Hi,
I'm using omnios bloody 20240111 to see if my HPE MR216i-p controller is recognized.
It  is, its pci id, pciex1000,10e2, is between the ones assigned to lmrc, but when booting I get a panic inside the driver which can be seen here

https://imgur.com/a/qhRuYJ2

Sorry for the low quality.

This is on a HPE ML30 Gen9 unit, I can give more info if needed.

I'd also like to thank all involved in its development, because this driver is a much needed component to be able to use modern HPE servers.

Best regards.
Maurilio.

[-- Attachment #2: Type: text/html, Size: 821 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-07 10:31 panic inside lmrc driver maurilio.longo
@ 2024-02-07 11:14 ` Peter Tribble
  2024-02-07 11:16 ` Hans Rosenfeld
  1 sibling, 0 replies; 37+ messages in thread
From: Peter Tribble @ 2024-02-07 11:14 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 1418 bytes --]

On Wed, Feb 7, 2024 at 10:31 AM maurilio.longo via illumos-developer <
developer@lists.illumos.org> wrote:

> Hi,
> I'm using omnios bloody 20240111 to see if my HPE MR216i-p controller is
> recognized.
>

There's a newer version - 20240206 - which I think has at least one lmrc
fix.

You may need to scroll to the bottom of the download page to see it

https://downloads.omnios.org/media/bloody/


> It  is, its pci id, pciex1000,10e2, is between the ones assigned to lmrc,
> but when booting I get a panic inside the driver which can be seen here
>
> https://imgur.com/a/qhRuYJ2
>
> Sorry for the low quality.
>
> This is on a HPE ML30 Gen9 unit, I can give more info if needed.
>
> I'd also like to thank all involved in its development, because this
> driver is a much needed component to be able to use modern HPE servers.
>
> Best regards.
> Maurilio.
>
> *illumos <https://illumos.topicbox-beta.com/latest>* / illumos-developer
> / see discussions <https://illumos.topicbox-beta.com/groups/developer> +
> participants <https://illumos.topicbox-beta.com/groups/developer/members>
> + delivery options
> <https://illumos.topicbox-beta.com/groups/developer/subscription>
> Permalink
> <https://illumos.topicbox-beta.com/groups/developer/Tf091423c9add514f-M61629bdd91b45a5da2bea103>
>


-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

[-- Attachment #2: Type: text/html, Size: 2869 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-07 10:31 panic inside lmrc driver maurilio.longo
  2024-02-07 11:14 ` [developer] " Peter Tribble
@ 2024-02-07 11:16 ` Hans Rosenfeld
  2024-02-07 11:46   ` maurilio.longo
  1 sibling, 1 reply; 37+ messages in thread
From: Hans Rosenfeld @ 2024-02-07 11:16 UTC (permalink / raw)
  To: illumos-developer

Hi Maurilio,

On Wed, Feb 07, 2024 at 05:31:28AM -0500, maurilio.longo via illumos-developer wrote:
> I'm using omnios bloody 20240111 to see if my HPE MR216i-p controller
> is recognized.
> It  is, its pci id, pciex1000,10e2, is between the ones assigned to
> lmrc, but when booting I get a panic inside the driver which can be
> seen here
> 
> https://imgur.com/a/qhRuYJ2

This is a panic stack I haven't seen before. Obviously it's running into
the default case at the end of lmrc_process_mpt_pkt(), meaning lmrc got
an unknown status from a command that completed and doesn't know how to
proceed from there.

Can you please send me the panic message? That should be included near
the end of the output of ::msgbuf, just before the panic stack. It
should look like this:

command failed, status = %x, ex_status = %x, cdb[0] = %x

(That being said, I realize starting a panic message with a ! is a bug
in itself. Sorry.)

Hans

-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-07 11:16 ` Hans Rosenfeld
@ 2024-02-07 11:46   ` maurilio.longo
  2024-02-07 12:04     ` maurilio.longo
                       ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: maurilio.longo @ 2024-02-07 11:46 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 61 bytes --]

Hi Hans,
here it is
https://imgur.com/a/ZaSogO7

Maurilio

[-- Attachment #2: Type: text/html, Size: 207 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-07 11:46   ` maurilio.longo
@ 2024-02-07 12:04     ` maurilio.longo
  2024-02-07 12:29     ` Hans Rosenfeld
  2024-02-07 13:01     ` maurilio.longo
  2 siblings, 0 replies; 37+ messages in thread
From: maurilio.longo @ 2024-02-07 12:04 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 164 bytes --]

Hi Peter,

> There's a newer version - 20240206 - which I think has at least one lmrc fix.

same stack trace using the 20240206 iso image.

Regards.
Maurilio

[-- Attachment #2: Type: text/html, Size: 346 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-07 11:46   ` maurilio.longo
  2024-02-07 12:04     ` maurilio.longo
@ 2024-02-07 12:29     ` Hans Rosenfeld
  2024-02-07 13:01     ` maurilio.longo
  2 siblings, 0 replies; 37+ messages in thread
From: Hans Rosenfeld @ 2024-02-07 12:29 UTC (permalink / raw)
  To: illumos-developer

On Wed, Feb 07, 2024 at 06:46:58AM -0500, maurilio.longo via illumos-developer wrote:
> Hi Hans,
> here it is
> https://imgur.com/a/ZaSogO7

Thanks. Apparently no one knows that status 0x76 is, but we probably
still shouldn't panic. I've filed a bug for this:

https://www.illumos.org/issues/16241

Do you have the means to build an OmniOS ISO with the patch included to
test it? If not, I can build one for you.


Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-07 11:46   ` maurilio.longo
  2024-02-07 12:04     ` maurilio.longo
  2024-02-07 12:29     ` Hans Rosenfeld
@ 2024-02-07 13:01     ` maurilio.longo
  2024-02-07 16:47       ` maurilio.longo
  2 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-07 13:01 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 319 bytes --]

Hi Hans,
I'm sorry, I don't know how to build it. 
I've tried, in the mean time, to boot it with freeBSD, and it boots and can see the disks, but it uses mr_sas as driver I'd say. 
Could it be that freeBSD's mr_sar has this status ?

In any case, if you can build me an ISO it would be great.
Thanks.
Maurilio.

[-- Attachment #2: Type: text/html, Size: 470 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-07 13:01     ` maurilio.longo
@ 2024-02-07 16:47       ` maurilio.longo
  2024-02-07 17:01         ` Hans Rosenfeld
  2024-02-07 21:35         ` Hans Rosenfeld
  0 siblings, 2 replies; 37+ messages in thread
From: maurilio.longo @ 2024-02-07 16:47 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 239 bytes --]

Hi Hans,
no need for a full ISO, I can install latest omniOS on a sata disk using the onboard AHCI controller after the removal of the MR216i-p, upgrade the driver and then reinstall the HBA and see if it works or not.
Regards.
Maurilio

[-- Attachment #2: Type: text/html, Size: 334 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-07 16:47       ` maurilio.longo
@ 2024-02-07 17:01         ` Hans Rosenfeld
  2024-02-07 21:35         ` Hans Rosenfeld
  1 sibling, 0 replies; 37+ messages in thread
From: Hans Rosenfeld @ 2024-02-07 17:01 UTC (permalink / raw)
  To: illumos-developer

On Wed, Feb 07, 2024 at 11:47:02AM -0500, maurilio.longo via illumos-developer wrote:
> Hi Hans,
> no need for a full ISO, I can install latest omniOS on a sata disk using the onboard AHCI controller after the removal of the MR216i-p, upgrade the driver and then reinstall the HBA and see if it works or not.
> Regards.
> Maurilio

Let me know when you have the install ready. I need to know the exact
'uname -v' output to build a matching lmrc module.

Or you wait another hour or so until my OmniOS ISO build finishes...


Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-07 16:47       ` maurilio.longo
  2024-02-07 17:01         ` Hans Rosenfeld
@ 2024-02-07 21:35         ` Hans Rosenfeld
  2024-02-08  7:59           ` maurilio.longo
  1 sibling, 1 reply; 37+ messages in thread
From: Hans Rosenfeld @ 2024-02-07 21:35 UTC (permalink / raw)
  To: illumos-developer

Hi Maurilio,

try this iso, please: https://grumpf.hope-2000.org/r151049.iso


Hans


On Wed, Feb 07, 2024 at 11:47:02AM -0500, maurilio.longo via illumos-developer wrote:
> Hi Hans,
> no need for a full ISO, I can install latest omniOS on a sata disk using the onboard AHCI controller after the removal of the MR216i-p, upgrade the driver and then reinstall the HBA and see if it works or not.
> Regards.
> Maurilio
> ------------------------------------------
> illumos: illumos-developer
> Permalink: https://illumos.topicbox.com/groups/developer/Tf091423c9add514f-M05c068719ffc969c3ef0b50b
> Delivery options: https://illumos.topicbox.com/groups/developer/subscription

-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-07 21:35         ` Hans Rosenfeld
@ 2024-02-08  7:59           ` maurilio.longo
  2024-02-08 10:42             ` Hans Rosenfeld
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-08  7:59 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 393 bytes --]

Hi Hans,
thank a lot for the ISO, I've just bootstrapped my PC with it and it does not panic anymore.
Disks are recognized and working.
I'll be making tests in the coming days to be sure everything works as expected.

Is this driver dependant on a particular kernel or can I use it with a non bloody build or different distros, like hipster?

Thanks again and best regards.
Maurilio.

[-- Attachment #2: Type: text/html, Size: 562 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-08  7:59           ` maurilio.longo
@ 2024-02-08 10:42             ` Hans Rosenfeld
  2024-02-08 11:01               ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: Hans Rosenfeld @ 2024-02-08 10:42 UTC (permalink / raw)
  To: illumos-developer

On Thu, Feb 08, 2024 at 02:59:32AM -0500, maurilio.longo via illumos-developer wrote:
> thank a lot for the ISO, I've just bootstrapped my PC with it and it does not panic anymore.
> Disks are recognized and working.
> I'll be making tests in the coming days to be sure everything works as
> expected.

Thanks!


> Is this driver dependant on a particular kernel or can I use it with a
> non bloody build or different distros, like hipster?

It may work. Or it may fail in interesting ways. I'd recommend waiting
for this fix to integrate, which should happen within the next days.


Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-08 10:42             ` Hans Rosenfeld
@ 2024-02-08 11:01               ` maurilio.longo
  2024-02-09  8:06                 ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-08 11:01 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 260 bytes --]

Hi Hans,
I've moved the controller on a newer PC, this is an ML30 Gen10plus, and here you can see, if you're interested, the boot log.

https://pastebin.com/Z9gPsLwD

There are a few status 76 messages without apparent consequences.

Regards.
Maurilio.

[-- Attachment #2: Type: text/html, Size: 460 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-08 11:01               ` maurilio.longo
@ 2024-02-09  8:06                 ` maurilio.longo
  2024-02-09 19:55                   ` Hans Rosenfeld
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-09  8:06 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 2961 bytes --]

Hi Hans,
I got a new panic this morning (for which I have the dump) and one yesterday, but my dump space was limited and so I've lost it.

this is last page of ::msgbuf and i see lmrc in the stack so I presume it is related.

IP Filter: v4.1.9, running.
WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
NOTICE: lmrc0: Drive 00(e252/Port 1I Box 0 Bay 0) Path 300062b20dde3640  reset (Type 03)
NOTICE: bge0: bge_check_copper: link now up speed 1000 duplex 2
NOTICE: bge0 link up, 1000 Mbps, full duplex

panic[cpu1]/thread=fffffe00f4f1fc20:
BAD TRAP: type=e (#pf Page fault) rp=fffffe00f4f1f7e0 addr=0 occurred in module "scsi" due to a NULL pointer dereference


sched:
#pf Page fault
Bad kernel fault at addr=0x0
pid=0, pc=0xfffffffff3908e2d, sp=0xfffffe00f4f1f8d0, eflags=0x10246
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe>  cr4: 3626f8<smap,smep,osxsav,pcide,vmxe,xmme,fxsr,pge,mce,pae,pse,de>
cr2: 0
cr3: 8000000
cr8: 0

        rdi:                0 rsi:                1 rdx: fffffe00f4f1fc20
        rcx:                2  r8:       4c1224e968  r9:                0
        rax:                0 rbx:                0 rbp: fffffe00f4f1f920
        r10:       d5895e41ab r11: fffffe00f4f1fc20 r12:                0
        r13: fffffeb1dab77000 r14:                1 r15: fffffeb1daacc058
        fsb:                0 gsb: fffffeb1d4daf000  ds:               4b
         es:               4b  fs:                0  gs:              1c3
        trp:                e err:                0 rip: fffffffff3908e2d
         cs:               30 rfl:            10246 rsp: fffffe00f4f1f8d0
         ss:               38

fffffe00f4f1f6f0 unix:die+c0 ()
fffffe00f4f1f7d0 unix:trap+999 ()
fffffe00f4f1f7e0 unix:cmntrap+e9 ()
fffffe00f4f1f920 scsi:scsi_tgtmap_beginf+2d ()
fffffe00f4f1f940 scsi:scsi_hba_tgtmap_set_begin+16 ()
fffffe00f4f1fa90 lmrc:lmrc_phys_update_tgtmap+40 ()
fffffe00f4f1fad0 lmrc:lmrc_get_pd_list+5a ()
fffffe00f4f1faf0 lmrc:lmrc_phys_aen_handler+4d ()
fffffe00f4f1fb50 lmrc:lmrc_aen_handler+1eb ()
fffffe00f4f1fc00 genunix:taskq_thread+2a6 ()
fffffe00f4f1fc10 unix:thread_start+b ()

dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel + curproc
NOTICE: ahci0: ahci_tran_reset_dport port 5 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 6 reset port
>
> ::quit

Regards
Maurilio.

[-- Attachment #2: Type: text/html, Size: 5193 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-09  8:06                 ` maurilio.longo
@ 2024-02-09 19:55                   ` Hans Rosenfeld
  2024-02-09 20:36                     ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: Hans Rosenfeld @ 2024-02-09 19:55 UTC (permalink / raw)
  To: illumos-developer

On Fri, Feb 09, 2024 at 03:06:50AM -0500, maurilio.longo via illumos-developer wrote:
> Hi Hans,
> I got a new panic this morning (for which I have the dump) and one
> yesterday, but my dump space was limited and so I've lost it.

Can you get me the dump, provided you still have one or can get one?

Was this system running the OmniOS that I prepared for you earlier this
week?


Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-09 19:55                   ` Hans Rosenfeld
@ 2024-02-09 20:36                     ` maurilio.longo
  2024-02-12  8:11                       ` maurilio.longo
  2024-02-13 19:23                       ` Hans Rosenfeld
  0 siblings, 2 replies; 37+ messages in thread
From: maurilio.longo @ 2024-02-09 20:36 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 190 bytes --]

Hi Hans,
yes I'm running your ISO while doing tests  and here you can find the dump file

https://mega.nz/file/cmkxBATJ#89P3wguYNEBxeeib4AZxwIKVEXR-AmTBl56C4zX8nxE

Regards.
Maurilio.

[-- Attachment #2: Type: text/html, Size: 416 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-09 20:36                     ` maurilio.longo
@ 2024-02-12  8:11                       ` maurilio.longo
  2024-02-13 19:32                         ` Hans Rosenfeld
  2024-02-13 19:23                       ` Hans Rosenfeld
  1 sibling, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-12  8:11 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 6257 bytes --]

Hi Hans,
I'm sorry to report that I got a new problem, I was executing  a zfs recv of a few GBs of data while, at the same time, executing a dd if=/dev/zero of=/... onto my test pool, which right now has 4 disks in two mirror vdevs, when the pool stopped responding.

In /var/adm/messages this is what I have 

Feb  9 19:38:54 pg-1 ipmi: [ID 183295 kern.info] SMBIOS type 0x1, addr 0xca2
Feb  9 19:38:54 pg-1 ipmi: [ID 306142 kern.info] device rev. 3, firmware rev. 2.65, version 2.0
Feb  9 19:38:54 pg-1 ipmi: [ID 935091 kern.info] number of channels 2
Feb  9 19:38:54 pg-1 ipmi: [ID 699450 kern.info] watchdog supported
Feb  9 19:38:55 pg-1 scsi: [ID 583861 kern.info] ses0 at lmrc2: target-port w300162b20dde3640 lun 0
Feb  9 19:38:55 pg-1 genunix: [ID 936769 kern.info] ses0 is /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@p0/enclosure@w300162b20dde3>
Feb  9 19:42:05 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Fri>
Feb  9 19:51:46 pg-1 rootnex: [ID 349649 kern.info] xsvc0 at root: space 0 offset 0
Feb  9 19:51:46 pg-1 genunix: [ID 936769 kern.info] xsvc0 is /xsvc@0,0
Feb 10 08:24:08 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 10 08:24:16 pg-1 lmrc: [ID 998901 kern.warning] WARNING: lmrc0: AEN failed, status = 255
Feb 10 08:24:16 pg-1 lmrc: [ID 831201 kern.warning] WARNING: lmrc0: PD map sync failed, status = 255
Feb 10 08:24:17 pg-1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dati' has encountered an uncorrectable I/O failure and has been su>
Feb 10 08:24:22 pg-1 lmrc: [ID 864919 kern.notice] NOTICE: lmrc0: FW is in fault state!
Feb 10 08:24:22 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 10 08:28:02 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Sat>
Feb 10 08:28:02 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Sat>
Feb 10 08:28:03 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Sat>
Feb 10 08:28:03 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Sat>
Feb 10 08:28:10 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 10 08:28:18 pg-1 lmrc: [ID 380853 kern.warning] WARNING: lmrc0: LD target map sync failed, status = 255
Feb 10 08:34:49 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 10 08:41:27 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 10 08:48:06 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 10 08:54:45 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 10 09:01:24 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 10 09:08:03 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 10 09:14:42 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 10 09:21:21 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...

and it was still trying to reset it this morning, iostat -indexC shows this 

r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
    0.0  123.3    0.0  651.6  0.0  0.0    0.0    0.3   0   4   0   0   0   0 c2
    0.0   61.1    0.0  325.8  0.0  0.0    0.0    0.4   0   3   0   0   0   0 c2t3d0
    0.0   62.1    0.0  325.8  0.0  0.0    0.0    0.2   0   1   0   0   0   0 c2t4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c3
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c3t001B448B4A7140BFd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0 116 116 c4
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0  24  24 c4t50014EE6B2513C38d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0  24  24 c4t5000CCA85EE5ECB0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0  18  18 c4t5000C500AAF9B0C3d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0  50  50 c4t5000C500AAF9BF2Fd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c5
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c5tACE42E0005CFF480d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 dati
    0.0  119.3    0.0  651.6  0.4  0.0    3.3    0.3   3   3   0   0   0   0 rpool


here the pool configuration where the problem occurred

  pool: dati
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:53:43 with 0 errors on Fri Feb  9 19:03:47 2024
config:

        NAME                       STATE     READ WRITE CKSUM
        dati                       ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            c4t5000CCA85EE5ECB0d0  ONLINE       0     0     0
            c4t50014EE6B2513C38d0  ONLINE       0     0     0
          mirror-2                 ONLINE       0     0     0
            c4t5000C500AAF9B0C3d0  ONLINE       0     0     0
            c4t5000C500AAF9BF2Fd0  ONLINE       0     0     0
        logs
          c3t001B448B4A7140BFd0s0  ONLINE       0     0     0

errors: No known data errors

So I forced a reboot with a dump, which you can find here:

https://mega.nz/file/YqczlQbS#XJ7q0-NIDezq3czIu3qyqVdEs8JA7aM2uAicEUEjX0E

Best regards.
Maurilio

[-- Attachment #2: Type: text/html, Size: 21072 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-09 20:36                     ` maurilio.longo
  2024-02-12  8:11                       ` maurilio.longo
@ 2024-02-13 19:23                       ` Hans Rosenfeld
  1 sibling, 0 replies; 37+ messages in thread
From: Hans Rosenfeld @ 2024-02-13 19:23 UTC (permalink / raw)
  To: illumos-developer

On Fri, Feb 09, 2024 at 03:36:56PM -0500, maurilio.longo via illumos-developer wrote:
> Hi Hans,
> yes I'm running your ISO while doing tests  and here you can find the dump file
> 
> https://mega.nz/file/cmkxBATJ#89P3wguYNEBxeeib4AZxwIKVEXR-AmTBl56C4zX8nxE

Thanks! I've filed bug for this: https://www.illumos.org/issues/16277

Can you reproduce this easily? I can build another ISO for testing if
you like.


Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-12  8:11                       ` maurilio.longo
@ 2024-02-13 19:32                         ` Hans Rosenfeld
  2024-02-13 20:55                           ` maurilio.longo
  2024-02-13 21:03                           ` maurilio.longo
  0 siblings, 2 replies; 37+ messages in thread
From: Hans Rosenfeld @ 2024-02-13 19:32 UTC (permalink / raw)
  To: developer

On Mon, Feb 12, 2024 at 03:11:52AM -0500, maurilio.longo via illumos-developer wrote:
> Hi Hans,
> I'm sorry to report that I got a new problem, I was executing  a zfs recv of a few GBs of data while, at the same time, executing a dd if=/dev/zero of=/... onto my test pool, which right now has 4 disks in two mirror vdevs, when the pool stopped responding.
> 
> In /var/adm/messages this is what I have 
> 
> Feb  9 19:38:54 pg-1 ipmi: [ID 183295 kern.info] SMBIOS type 0x1, addr 0xca2
> Feb  9 19:38:54 pg-1 ipmi: [ID 306142 kern.info] device rev. 3, firmware rev. 2.65, version 2.0
> Feb  9 19:38:54 pg-1 ipmi: [ID 935091 kern.info] number of channels 2
> Feb  9 19:38:54 pg-1 ipmi: [ID 699450 kern.info] watchdog supported
> Feb  9 19:38:55 pg-1 scsi: [ID 583861 kern.info] ses0 at lmrc2: target-port w300162b20dde3640 lun 0
> Feb  9 19:38:55 pg-1 genunix: [ID 936769 kern.info] ses0 is /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@p0/enclosure@w300162b20dde3>
> Feb  9 19:42:05 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Fri>
> Feb  9 19:51:46 pg-1 rootnex: [ID 349649 kern.info] xsvc0 at root: space 0 offset 0
> Feb  9 19:51:46 pg-1 genunix: [ID 936769 kern.info] xsvc0 is /xsvc@0,0
> Feb 10 08:24:08 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
> Feb 10 08:24:16 pg-1 lmrc: [ID 998901 kern.warning] WARNING: lmrc0: AEN failed, status = 255
> Feb 10 08:24:16 pg-1 lmrc: [ID 831201 kern.warning] WARNING: lmrc0: PD map sync failed, status = 255
> Feb 10 08:24:17 pg-1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dati' has encountered an uncorrectable I/O failure and has been su>
> Feb 10 08:24:22 pg-1 lmrc: [ID 864919 kern.notice] NOTICE: lmrc0: FW is in fault state!
> Feb 10 08:24:22 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
> Feb 10 08:28:02 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Sat>
> Feb 10 08:28:02 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Sat>
> Feb 10 08:28:03 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Sat>
> Feb 10 08:28:03 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Sat>
> Feb 10 08:28:10 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
> Feb 10 08:28:18 pg-1 lmrc: [ID 380853 kern.warning] WARNING: lmrc0: LD target map sync failed, status = 255
> Feb 10 08:34:49 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
> Feb 10 08:41:27 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
> Feb 10 08:48:06 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
> Feb 10 08:54:45 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
> Feb 10 09:01:24 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
> Feb 10 09:08:03 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
> Feb 10 09:14:42 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
> Feb 10 09:21:21 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...

This looks vaguely similar to what we've seen on DELL H755 controllers,
where the controller just drops dead after a while, needing a full power
cycle to come back to life. (See https://www.illumos.org/issues/15935)

If you just reset the system (not power cycling), does it come up
correctly again after this?


Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-13 19:32                         ` Hans Rosenfeld
@ 2024-02-13 20:55                           ` maurilio.longo
  2024-02-13 21:03                           ` maurilio.longo
  1 sibling, 0 replies; 37+ messages in thread
From: maurilio.longo @ 2024-02-13 20:55 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 895 bytes --]

Hi Hans,

> Thanks! I've filed bug for this: https://www.illumos.org/issues/16277

reguarding your bug report,  I had just plugged in a new disk before expanding my pool from two disks to four, and I did this as soon as the system reached the login: prompt after a reboot.

When I inserted the second one, nothing happened.

Thursday I'll be able to try to add a new disk right after a reboot to see if I can cause the problem again.

Btw,

> char [128] evt_descr = [ "Inserted: Drive 02(e252/Port 1I Box 0 Bay 0)" ]

My unit has a single SFF disk cage which can hold 8 disks, I did insert mine into slot 5 or 6, but as you can see it shows Bay 0, which is wrong; the cage has disks numbered from 1 to 8, from left to right.

This is the cage: https://www.servershop24.de/en/hpe-sff-gen9-gen10-cage/a-117573/

Thanks, I'll let you know if I can break it again ;-)

Maurilio.

[-- Attachment #2: Type: text/html, Size: 1459 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-13 19:32                         ` Hans Rosenfeld
  2024-02-13 20:55                           ` maurilio.longo
@ 2024-02-13 21:03                           ` maurilio.longo
  2024-02-15  6:50                             ` maurilio.longo
  1 sibling, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-13 21:03 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 585 bytes --]

Hi Hans,

> This looks vaguely similar to what we've seen on DELL H755 controllers,

I think I did just issue a reboot from the remote session and it rebooted without problems.

Tomorrow I'm away so the system, which is powered on, will be idle most of the time, let's see if it happens again, in which case I'll just reset it, meaning, [Ctrl][Alt][Del] from the console.

I've also changed a couple of things since the other day: disabled ILO and changed the power profile, see my other thread on ACPI parsing errors.

Best regards and thanks for your help.

Maurilio.

[-- Attachment #2: Type: text/html, Size: 876 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-13 21:03                           ` maurilio.longo
@ 2024-02-15  6:50                             ` maurilio.longo
  2024-02-15  7:38                               ` Carsten Grzemba
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-15  6:50 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 261 bytes --]

Hi Hans,
this morning I've found the unit completely frozen, no keyboard input, no network access.
I had to power cycle it. 
Nothing inside /var/adm/messages, apart from a reboot yesterday, but with no new crash dump in /var/crash. 

Regards.
Maurilio

[-- Attachment #2: Type: text/html, Size: 412 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-15  6:50                             ` maurilio.longo
@ 2024-02-15  7:38                               ` Carsten Grzemba
  2024-02-15  8:11                                 ` Toomas Soome
  0 siblings, 1 reply; 37+ messages in thread
From: Carsten Grzemba @ 2024-02-15  7:38 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 501 bytes --]

If it again the  case try to force a crash dump on shutdown, e.g.
add /etc/system:

set pcplusmp:apic_panic_on_nmi = 1

then if the system is frozen
$ ipmitool -Ilanplus -U idrac-user -P password -H idrac-ip power diag

Note that you probably won't have any luck with a crash dump if the dump device is controlled by lmrc (if lmrc is the reason for frozen system)

long text:
https://illumos.org/docs/user-guide/debug-systems/#gathering-information-from-a-running-system-using-only-nmi-x86

[-- Attachment #2: Type: text/html, Size: 1227 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-15  7:38                               ` Carsten Grzemba
@ 2024-02-15  8:11                                 ` Toomas Soome
  2024-02-15  9:44                                   ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: Toomas Soome @ 2024-02-15  8:11 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 1396 bytes --]



> On 15. Feb 2024, at 09:38, Carsten Grzemba via illumos-developer <developer@lists.illumos.org> wrote:
> 
> If it again the  case try to force a crash dump on shutdown, e.g.
> add /etc/system:
> 
> set pcplusmp:apic_panic_on_nmi = 1
> 
> then if the system is frozen
> $ ipmitool -Ilanplus -U idrac-user -P password -H idrac-ip power diag
> 
> Note that you probably won't have any luck with a crash dump if the dump device is controlled by lmrc (if lmrc is the reason for frozen system)
> 
> long text:
> https://illumos.org/docs/user-guide/debug-systems/#gathering-information-from-a-running-system-using-only-nmi-x86
> illumos <https://illumos.topicbox.com/latest> / illumos-developer / see discussions <https://illumos.topicbox.com/groups/developer> + participants <https://illumos.topicbox.com/groups/developer/members> + delivery options <https://illumos.topicbox.com/groups/developer/subscription>Permalink <https://illumos.topicbox.com/groups/developer/Tf091423c9add514f-M400094e7f67a309bd7a2ce68>
in such case it is good idea to boot as:

ok set nmi=kmdb
ok boot -k 
or 
ok boot -kd

this way your nmi will get you to the kmdb.

And yes, we should document the ‘nmi’ property, it can have values ‘ignore’, ‘panic’ and ‘kmdb’, see https://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/os/mlsetup.c?r=d32f26ee#159

rgds,
toomas

[-- Attachment #2: Type: text/html, Size: 2963 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-15  8:11                                 ` Toomas Soome
@ 2024-02-15  9:44                                   ` maurilio.longo
  2024-02-16 14:36                                     ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-15  9:44 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 389 bytes --]

Hi Carsten and Toomas,
the 
set pcplusmp:apic_panic_on_nmi = 1

is already present in /etc/system.d/_omnios:system:defaults and I've added there  a

set snooping=1 

which should enable the deadman timer.

If this is not enough I'll re-enable ILO (iDrac on HPE systems) and try with the ipmitool  as per your suggestion.

Thanks to both, I'll keep you posted.

Maurilio.

[-- Attachment #2: Type: text/html, Size: 648 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-15  9:44                                   ` maurilio.longo
@ 2024-02-16 14:36                                     ` maurilio.longo
  2024-02-16 15:20                                       ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-16 14:36 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 5175 bytes --]

Hi Hans,
the system just rebooted by itself, no crash dump has been created but now I think I'm in this situation

> This looks vaguely similar to what we've seen on DELL H755 controllers,
> where the controller just drops dead after a while, needing a full power
> cycle to come back to life. (See https://www.illumos.org/issues/15935)

because of this:

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 16 13:38:31 ba948ca4-681c-4d79-981c-6da3ca6e6b05  PCIEX-8000-DJ  Major

Host        : pg-1
Platform    : ProLiant-ML30-Gen10-Plus  Chassis_id  : CZJ2360PLT
Product_sn  :

Fault class : fault.io.pciex.device-noresp 40%
              fault.io.pciex.device-interr 40%
              fault.io.pciex.bus-noresp 20%
Affects     : dev:////pci@0,0/pci8086,43b8@1c/pci1590,32b@0
                  faulted and taken out of service
FRU         : "PCI-E Slot 4" (hc://:product-id=ProLiant-ML30-Gen10-Plus:server-id=pg-1:chassis-id=CZJ2360PLT/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=3/pciexdev=0)
                  faulty

Description : A problem has been detected on one of the specified devices or on
              one of the specified connecting buses.
              Refer to http://illumos.org/msg/PCIEX-8000-DJ for more
              information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : If a plug-in card is involved check for badly-seated cards or
              bent pins. Otherwise schedule a repair procedure to replace the
              affected device(s).  Use fmadm faulty to identify the devices or
              contact your illumos distribution team for support.

PCI-E Slot 4 is where the controller is located.

zpool status shows the pool as online, it is not

zpool status
  pool: dati
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: scrub repaired 0 in 0 days 01:07:00 with 0 errors on Fri Feb 16 12:30:01 2024
config:

        NAME                       STATE     READ WRITE CKSUM
        dati                       ONLINE       0    39     0
          mirror-0                 ONLINE       0    45     0
            c4t5000CCA85EE5ECB0d0  ONLINE       0    49     0
            c4t50014EE6B2513C38d0  ONLINE       0    49     0
          mirror-2                 ONLINE       0    73     0
            c4t5000C500AAF9B0C3d0  ONLINE       0    83     0
            c4t5000C500AAF9BF2Fd0  ONLINE       0    83     0
        logs
          c3t001B448B4A7140BFd0s0  ONLINE       0     0     0
        cache
          c3t001B448B4A7140BFd0s1  ONLINE       0     0     0

errors: 20 data errors, use '-v' for a list

zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
dati    928G  80.1G   848G        -         -    12%     8%  1.00x    ONLINE  -
rpool   222G   138G  83.5G        -         -     0%    62%  1.00x    ONLINE  -

with format I don't see the vdevs upon which dati is built.

format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c2t3d0 <SanDisk-SSD PLUS 240GB-UF8704RL-223.57GB>
          /pci@0,0/pci1590,28d@17/disk@3,0
       1. c2t4d0 <WDC- WDS100T1R0A-68A4W0-411010WR-931.51GB>
          /pci@0,0/pci1590,28d@17/disk@4,0
       2. c3t001B448B4A7140BFd0 <WD_BLACK-SN770 500GB-731100WD cyl 38932 alt 0 hd 224 sec 112>
          /pci@0,0/pci8086,43c4@1b,4/pci15b7,5017@0/blkdev@w001B448B4A7140BF,0
       3. c5tACE42E0005CFF480d0 <NVMe-VS000480KXALB-85030G00-447.13GB>
          /pci@0,0/pci8086,43b0@1d/pci1c5c,2f3@0/blkdev@wACE42E0005CFF480,0
Specify disk (enter its number): ^C

zpool status -v dati gives this error, probably because it can't read from the pool

errors: List of errors unavailable (insufficient privileges)

The last line of dmesg reads
Feb 16 13:38:43 pg-1 genunix: [ID 390243 kern.info] Creating /etc/devices/retire_store

which contains the ID for the controller.

I'll power it off to see if upon restart it goes back to a working state.

Regards.
Maurilio

[-- Attachment #2: Type: text/html, Size: 22699 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-16 14:36                                     ` maurilio.longo
@ 2024-02-16 15:20                                       ` maurilio.longo
  2024-02-16 15:27                                         ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-16 15:20 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 4614 bytes --]

The system reboots, I can see all the disks spinning, on the console appears

reading ZFS configuration
mounting filesystems (n/m)

Then a new 
     Creating /etc/devices/retire_store

appears on /var/adm/messages and the pool is marked as unavailable and the disks become invisible to format.

I've power-cycled the system several times now, without changes, it seems that the controller cannot be used anymore.

modinfo | grep lmrc
194 fffffffff3fb1000   a388 123   1  lmrc (Broadcom MegaRAID 12G SAS RAID)

During boot disks seem to be ok

dmesg | grep lmrc
Feb 16 15:57:20 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:21 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:21 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:21 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:21 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:21 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:21 pg-1 scsi: [ID 583861 kern.info] sd3 at lmrc1: target-port 5000cca85ee5ecb0 lun 0
Feb 16 15:57:23 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:23 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:23 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:23 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:23 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:23 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:24 pg-1 scsi: [ID 583861 kern.info] sd2 at lmrc1: target-port 50014ee6b2513c38 lun 0
Feb 16 15:57:25 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:25 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:25 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:25 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:25 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:26 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:26 pg-1 scsi: [ID 583861 kern.info] sd4 at lmrc1: target-port 5000c500aaf9b0c3 lun 0
Feb 16 15:57:27 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:27 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:27 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:27 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:28 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:28 pg-1 lmrc: [ID 934365 kern.warning] WARNING: lmrc0: command failed, status = 76, ex_status = 0, cdb[0] = 1b
Feb 16 15:57:28 pg-1 scsi: [ID 583861 kern.info] sd5 at lmrc1: target-port 5000c500aaf9bf2f lun 0
Feb 16 15:57:33 pg-1 genunix: [ID 408114 kern.info] /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@p0 (lmrc2) online
Feb 16 15:57:35 pg-1 scsi: [ID 583861 kern.info] ses0 at lmrc2: target-port w300162b20dde3640 lun 0

and

dmesg | grep sd4
Feb 16 15:57:26 pg-1 scsi: [ID 583861 kern.info] sd4 at lmrc1: target-port 5000c500aaf9b0c3 lun 0
Feb 16 15:57:26 pg-1 genunix: [ID 936769 kern.info] sd4 is /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@v0/disk@5000c500aaf9b0c3,0
Feb 16 15:57:27 pg-1 genunix: [ID 408114 kern.info] /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@v0/disk@5000c500aaf9b0c3,0 (sd4) online

Maurilio.



[-- Attachment #2: Type: text/html, Size: 11872 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-16 15:20                                       ` maurilio.longo
@ 2024-02-16 15:27                                         ` maurilio.longo
  2024-02-16 15:42                                           ` Robert Mustacchi
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-16 15:27 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 135 bytes --]

fmadm faulty -f
fmadm repaired  "PCI-E Slot 4"

Fixed the issue with invisible disks... sorry for the noise.

Regards.
Maurilio

[-- Attachment #2: Type: text/html, Size: 273 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-16 15:27                                         ` maurilio.longo
@ 2024-02-16 15:42                                           ` Robert Mustacchi
  2024-02-16 20:41                                             ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: Robert Mustacchi @ 2024-02-16 15:42 UTC (permalink / raw)
  To: illumos-developer

Hi Maurilio,

On 2/16/24 07:27, maurilio.longo via illumos-developer wrote:
> fmadm faulty -f
> fmadm repaired  "PCI-E Slot 4"
> 
> Fixed the issue with invisible disks... sorry for the noise.

Having this happen suggests that the PCIe controller or its
corresponding root port observed AERs (PCIe's error reporting
mechanism). If you look at fmudmp -e do you have entries that relate to
around the timestamp of the original time the controller was retired?

Robert

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-16 15:42                                           ` Robert Mustacchi
@ 2024-02-16 20:41                                             ` maurilio.longo
  2024-02-19 21:14                                               ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-16 20:41 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 4042 bytes --]

Hi Robert,

yes, here it is fmdump -e at around 01:30pm

Feb 16 13:31:24.4290 ereport.io.pci.fabric
Feb 16 13:31:24.4290 ereport.io.pciex.a-nonfatal
Feb 16 13:31:24.4290 ereport.io.pciex.tl.cto
Feb 16 13:31:24.4290 ereport.io.pciex.rc.ce-msg
Feb 16 13:38:31.5763 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5766 ereport.fs.zfs.data
Feb 16 13:38:31.5766 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5766 ereport.fs.zfs.data
Feb 16 13:38:31.5766 ereport.fs.zfs.data
Feb 16 13:38:31.5766 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5764 ereport.fs.zfs.data
Feb 16 13:38:31.5763 ereport.fs.zfs.data
Feb 16 13:38:31.5766 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5766 ereport.fs.zfs.data
Feb 16 13:38:31.5765 ereport.fs.zfs.data
Feb 16 13:38:31.5766 ereport.fs.zfs.data
Feb 16 13:38:31.5766 ereport.fs.zfs.data
Feb 16 13:38:31.5771 ereport.fs.zfs.io_failure
Feb 16 15:11:59.2352 ereport.fs.zfs.io
Feb 16 15:11:59.2352 ereport.fs.zfs.io
Feb 16 15:11:59.2352 ereport.fs.zfs.io
Feb 16 15:11:59.2352 ereport.fs.zfs.io

and fmdump -v 

Feb 16 13:38:31.5102 ba948ca4-681c-4d79-981c-6da3ca6e6b05 PCIEX-8000-DJ Diagnosed
   40%  fault.io.pciex.device-noresp

        Problem in: hc://:product-id=ProLiant-ML30-Gen10-Plus:server-id=pg-1:chassis-id=CZJ2360PLT/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=3/pciexdev=0/pciexfn=0
           Affects: dev:////pci@0,0/pci8086,43b8@1c/pci1590,32b@0
               FRU: hc://:product-id=ProLiant-ML30-Gen10-Plus:server-id=pg-1:chassis-id=CZJ2360PLT/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=3/pciexdev=0
          Location: PCI-E Slot 4

   40%  fault.io.pciex.device-interr

        Problem in: hc://:product-id=ProLiant-ML30-Gen10-Plus:server-id=pg-1:chassis-id=CZJ2360PLT/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=3/pciexdev=0/pciexfn=0
           Affects: dev:////pci@0,0/pci8086,43b8@1c/pci1590,32b@0
               FRU: hc://:product-id=ProLiant-ML30-Gen10-Plus:server-id=pg-1:chassis-id=CZJ2360PLT/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=3/pciexdev=0
          Location: PCI-E Slot 4

   20%  fault.io.pciex.bus-noresp

        Problem in: hc://:product-id=ProLiant-ML30-Gen10-Plus:server-id=pg-1:chassis-id=CZJ2360PLT/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=3/pciexdev=0/pciexfn=0
           Affects: dev:////pci@0,0/pci8086,43b8@1c/pci1590,32b@0
               FRU: hc://:product-id=ProLiant-ML30-Gen10-Plus:server-id=pg-1:chassis-id=CZJ2360PLT/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=3/pciexdev=0
          Location: PCI-E Slot 4

Feb 16 13:38:32.3326 447a2843-776e-4e39-a509-25817d74bf2d ZFS-8000-HC Diagnosed
  100%  fault.fs.zfs.io_failure_wait

        Problem in: zfs://pool=dati
           Affects: zfs://pool=dati
               FRU: -
          Location: -

Regards.
Maurilio.

[-- Attachment #2: Type: text/html, Size: 22867 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-16 20:41                                             ` maurilio.longo
@ 2024-02-19 21:14                                               ` maurilio.longo
  2024-02-20  8:58                                                 ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-19 21:14 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 4015 bytes --]

Hi all,
new problem today, similar to a previous one with loss of access to disks inside 'dati' pool.

Feb 19 15:39:41 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 19 15:39:42 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 19 15:39:42 pg-1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Feb 19 15:39:42 pg-1 lmrc: [ID 383856 kern.warning] WARNING: lmrc0: reset failed
Feb 19 15:39:42 pg-1 lmrc: [ID 998901 kern.warning] WARNING: lmrc0: AEN failed, status = 255
Feb 19 15:39:42 pg-1 lmrc: [ID 380853 kern.warning] WARNING: lmrc0: LD target map sync failed, status = 255
Feb 19 15:39:42 pg-1 lmrc: [ID 831201 kern.warning] WARNING: lmrc0: PD map sync failed, status = 255
Feb 19 15:39:42 pg-1 lmrc: [ID 383856 kern.warning] WARNING: lmrc0: reset failed
Feb 19 15:39:43 pg-1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dati' has encountered an uncorrectable I/O failure and has been sus>
Feb 19 15:39:43 pg-1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dati' has encountered an uncorrectable I/O failure and has been sus>
Feb 19 15:39:43 pg-1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dati' has encountered an uncorrectable I/O failure and has been sus>
Feb 19 15:39:43 pg-1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dati' has encountered an uncorrectable I/O failure and has been sus>
Feb 19 15:39:43 pg-1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dati' has encountered an uncorrectable I/O failure and has been sus>
Feb 19 15:39:43 pg-1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dati' has encountered an uncorrectable I/O failure and has been sus>
Feb 19 15:39:43 pg-1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dati' has encountered an uncorrectable I/O failure and has been sus>
Feb 19 15:39:43 pg-1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dati' has encountered an uncorrectable I/O failure and has been sus>
Feb 19 15:40:10 pg-1 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@v0/disk@5000cca85ee5ecb0,0 >
Feb 19 15:40:10 pg-1 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@v0/disk@50014ee6b2513c38,0 >
Feb 19 15:40:10 pg-1 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@v0/disk@5000c500aaf9b0c3,0 >
Feb 19 15:40:10 pg-1 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@v0/disk@5000c500aaf9bf2f,0 >
Feb 19 15:40:10 pg-1 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@v0/disk@5000cca85ee5ecb0,0 >
Feb 19 15:40:10 pg-1 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@v0/disk@50014ee6b2513c38,0 >
Feb 19 15:40:10 pg-1 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@v0/disk@5000c500aaf9b0c3,0 >
Feb 19 15:40:10 pg-1 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,43b8@1c/pci1590,32b@0/iport@v0/disk@5000c500aaf9bf2f,0 >
Feb 19 15:40:46 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Mon >
Feb 19 15:40:46 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Mon >
Feb 19 15:40:47 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Mon >
Feb 19 15:40:47 pg-1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major#012EVENT-TIME: Mon >
Feb 19 15:42:44 pg-1 unix: [ID 836849 kern.notice] #012#015panic[cpu0]/thread=fffffeb1f5b1e100:
Feb 19 15:42:44 pg-1 genunix: [ID 156897 kern.notice] forced crash dump initiated at user request
Feb 19 15:42:44 pg-1 unix: [ID 100000 kern.notice] #012

I've forced a kernel dump which can be seen here.

https://mega.nz/file/ojVHjQyI#63qlDThAL3FvM4pB04QmkjraMh4hoKXmaN5aHyC3-UU

Today, when the problem arised, I had just iozone running "alone".

Regards.
Maurilio.

[-- Attachment #2: Type: text/html, Size: 12578 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-19 21:14                                               ` maurilio.longo
@ 2024-02-20  8:58                                                 ` maurilio.longo
  2024-02-22 17:44                                                   ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-20  8:58 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 1692 bytes --]

After googling around about my problems I've made two changes, the first one was to upgrade the controller firmware from 52.16.3-3913, dated April 2021, to the lastest one, 52.26.3-5250_A, dated December 2023.

Then, given several reports of HPE Gen10 units rebooting unexpectedly all suggesting to change the workload profile to max performance, I did this change to the BIOS, loosing SpeedStep, as it appears from /var/adm/messages.

Feb 20 09:20:33 pg-1 unix: [ID 950921 kern.info] cpu1: x86 (chipid 0x0 GenuineIntel A0671 family 6 model 167 step 1 clock 2807 MHz)
Feb 20 09:20:33 pg-1 unix: [ID 950921 kern.info] cpu1: Intel(r) Xeon(r) E-2314 CPU @ 2.80GHz
Feb 20 09:20:33 pg-1 unix: [ID 557947 kern.info] cpu1 initialization complete - online
Feb 20 09:20:33 pg-1 unix: [ID 977644 kern.info] NOTICE: cpu_acpi: _TSS package bad count 1 for CPU 2.
Feb 20 09:20:33 pg-1 unix: [ID 340435 kern.info] NOTICE: Support for CPU throttling is being disabled due to errors parsing ACPI T-state objects exported by BIOS.
Feb 20 09:20:33 pg-1 unix: [ID 950921 kern.info] cpu2: x86 (chipid 0x0 GenuineIntel A0671 family 6 model 167 step 1 clock 2807 MHz)
Feb 20 09:20:33 pg-1 unix: [ID 950921 kern.info] cpu2: Intel(r) Xeon(r) E-2314 CPU @ 2.80GHz
Feb 20 09:20:33 pg-1 unix: [ID 557947 kern.info] cpu2 initialization complete - online
Feb 20 09:20:33 pg-1 unix: [ID 977644 kern.info] NOTICE: cpu_acpi: _TSS package bad count 1 for CPU 3.
Feb 20 09:20:33 pg-1 unix: [ID 340435 kern.info] NOTICE: Support for CPU throttling is being disabled due to errors parsing ACPI T-state objects exported by BIOS.

Restarted iozone, let's see if something improves or not.

Regards.
Maurilio.


[-- Attachment #2: Type: text/html, Size: 4554 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-20  8:58                                                 ` maurilio.longo
@ 2024-02-22 17:44                                                   ` maurilio.longo
  2024-02-26  8:59                                                     ` maurilio.longo
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-22 17:44 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 5446 bytes --]

So, after upgrading the controller's firmware I did reboot the computer and restart my tests which ended a few hours laters with the disks in a state similar to this:

                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
    0.0    3.0    0.0    0.1  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c2
    0.0    2.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c2t3d0
    0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c2t4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c3
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c3t001B448B4A7140BFd0
    0.0    0.0    0.0    0.0  0.0 13.0    0.0    0.0   0 400   0   0   0   0 c4
    0.0    0.0    0.0    0.0  0.0  3.0    0.0    0.0   0 100   0   0   0   0 c4t50014EE6B2513C38d0
    0.0    0.0    0.0    0.0  0.0  3.0    0.0    0.0   0 100   0   0   0   0 c4t5000CCA85EE5ECB0d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0 c4t5000C500AAF9B0C3d0
    0.0    0.0    0.0    0.0  0.0  3.0    0.0    0.0   0 100   0   0   0   0 c4t5000C500AAF9BF2Fd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c5
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c5tACE42E0005CFF480d0
    0.0    0.0    0.0    0.0 246.0 13.0    0.0    0.0 100 100   0   0   0   0 dati
    0.0    0.0    0.0    0.0  0.2  0.1    0.0    0.0   2   2   0   0   0   0 rpool
                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
    0.0   15.0    0.0    0.4  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c2
    0.0    7.0    0.0    0.2  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c2t3d0
    0.0    8.0    0.0    0.2  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c2t4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c3
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c3t001B448B4A7140BFd0
    0.0    0.0    0.0    0.0  0.0 13.0    0.0    0.0   0 400   0   0   0   0 c4
    0.0    0.0    0.0    0.0  0.0  3.0    0.0    0.0   0 100   0   0   0   0 c4t50014EE6B2513C38d0
    0.0    0.0    0.0    0.0  0.0  3.0    0.0    0.0   0 100   0   0   0   0 c4t5000CCA85EE5ECB0d0
    0.0    0.0    0.0    0.0  0.0  4.0    0.0    0.0   0 100   0   0   0   0 c4t5000C500AAF9B0C3d0
    0.0    0.0    0.0    0.0  0.0  3.0    0.0    0.0   0 100   0   0   0   0 c4t5000C500AAF9BF2Fd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c5
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c5tACE42E0005CFF480d0
    0.0    0.0    0.0    0.0 246.0 13.0    0.0    0.0 100 100   0   0   0   0 dati
    0.0    0.0    0.0    0.0  0.3  0.2    0.0    0.0   4   5   0   0   0   0 rpool

So I powered off the unit and restarted it but this time I executed the command "mdb -K" followed by ":c" on the console, to have the kernel debugger ready for the next lockup.

Instead my unit spent the next two days running my tests (iozone + zfs recv + zpool scrub) without any problem.

Given I needed a crash dump during the lockup,  I decided to re-enable the onboard ILO5 to be able to generate an NMI and started again my tests (without the mdb -K part) and went on with my other chores.

Well, it took less than three hours for the disks to become stuck again, just like above, but this time I've generated a NMI and I have a crash dump here:

https://mega.nz/file/16kShQ6Y#nAv0tLbIvydBy6uaEX87d1VXMD-NIeLkg4GgvDnMatY

::msgbuf ends with this lmrc warning/error

NOTICE: lmrc0: Drive 00(e252/Port 1I Box 0 Bay 0) Path 300062b20dde3640  reset (Type 03)
NOTICE: lmrc0: unknown AEN received, seqnum = 19954, timestamp = 761927783, code = 27f, locale = 2, class = 0, argtype = 10
NOTICE: lmrc0: Drive 00(e252/Port 1I Box 0 Bay 0) link speed changed

panic[cpu0]/thread=fffffe00f3e05c20:
NMI received

I hope this can shed some light on the problem because it seems that running with kernel debugger active it alters something (timings?) just that little that is needed to make the system (a lot more) solid. 
I have, very seldomly in the past two weeks, been able to run my tests for 48 straight hours without issues.

Best regards
Maurilio.

[-- Attachment #2: Type: text/html, Size: 16823 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-22 17:44                                                   ` maurilio.longo
@ 2024-02-26  8:59                                                     ` maurilio.longo
  2024-02-26  9:30                                                       ` Carsten Grzemba
  0 siblings, 1 reply; 37+ messages in thread
From: maurilio.longo @ 2024-02-26  8:59 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 548 bytes --]

Just for the record, after nearly a month of tests, last friday I decided I had enough and replaced the MR216i-p with an LSI 

prtdiag -v | grep -i lsi
4   in use    PCI Exp. Gen 3 x8 PCI-E Slot 4, Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (mpt_sas)

The system has been working ok since then, with all the tests running, no lockups, no timeouts etc.

Sadly, the lmrc driver, while promising, is not ready yet.

Thanks to all who helped me and gave advice and to Hans and all those who wrote lmrc.

Regards.
Maurilio.


[-- Attachment #2: Type: text/html, Size: 1312 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-26  8:59                                                     ` maurilio.longo
@ 2024-02-26  9:30                                                       ` Carsten Grzemba
  2024-08-26 12:30                                                         ` manipm
  0 siblings, 1 reply; 37+ messages in thread
From: Carsten Grzemba @ 2024-02-26  9:30 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 267 bytes --]

I hope you're not generally right. We have been using the driver for 3 months (Dell R650xs with PERC H355, Rpool with 2 NVME disks). We only had a problem once. The HBA was changed and the firmware was updated. Since then there has been no more trouble.







[-- Attachment #2: Type: text/html, Size: 1083 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-02-26  9:30                                                       ` Carsten Grzemba
@ 2024-08-26 12:30                                                         ` manipm
  2024-08-26 15:23                                                           ` Hans Rosenfeld
  0 siblings, 1 reply; 37+ messages in thread
From: manipm @ 2024-08-26 12:30 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 1832 bytes --]

I am also facing lmrc driver issue.  I recently purchased DELL R750xs with PERC H755(in passthrough mode) consisting of 5disk.

I am running latest DELL BIOS. Can someone help me on this.

Aug 26 05:17:43 server1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Aug 26 05:17:43 server1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Aug 26 05:17:44 server1 lmrc: [ID 408335 kern.warning] WARNING: lmrc0: resetting...
Aug 26 05:17:44 server1 lmrc: [ID 383856 kern.warning] WARNING: lmrc0: reset failed
Aug 26 05:17:44 server1 lmrc: [ID 998901 kern.warning] WARNING: lmrc0: AEN failed, status = 255
Aug 26 05:17:44 server1 lmrc: [ID 380853 kern.warning] WARNING: lmrc0: LD target map sync failed, status = 255
Aug 26 05:17:44 server1 lmrc: [ID 831201 kern.warning] WARNING: lmrc0: PD map sync failed, status = 255
Aug 26 05:17:44 server1 lmrc: [ID 383856 kern.warning] WARNING: lmrc0: reset failed
Aug 26 05:17:44 server1 zfs: [ID 961531 kern.warning] WARNING: Pool 'dpool' has encountered an uncorrectable I/O failure and has been suspended; `zpool
clear` will be required before the pool can be written to.
Aug 26 05:17:59 server1 scsi: [ID 107833 kern.warning] WARNING: /pci@bc,0/pci8086,347c@4/pci1028,1ae1@0/iport@v0/disk@5000c500f8162e4f,0 (sd2):#012#011d
rive offline#012
Aug 26 05:18:00 server1 scsi: [ID 107833 kern.warning] WARNING: /pci@bc,0/pci8086,347c@4/pci1028,1ae1@0/iport@v0/disk@5000c500f815d253,0 (sd6):#012#011d
rive offline#012
Aug 26 05:18:00 server1 scsi: [ID 107833 kern.warning] WARNING: /pci@bc,0/pci8086,347c@4/pci1028,1ae1@0/iport@v0/disk@5000c500f815ba63,0 (sd5):#012#011d
rive offline#012
Aug 26 05:18:00 server1 scsi: [ID 107833 kern.warning] WARNING: /pci@bc,0/pci8086,347c@4/pci1028,1ae1@0/iport@v0/disk@5000c500f815e003,0 (sd3):#012#011d
rive offline#012

[-- Attachment #2: Type: text/html, Size: 2087 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [developer] panic inside lmrc driver
  2024-08-26 12:30                                                         ` manipm
@ 2024-08-26 15:23                                                           ` Hans Rosenfeld
  0 siblings, 0 replies; 37+ messages in thread
From: Hans Rosenfeld @ 2024-08-26 15:23 UTC (permalink / raw)
  To: developer

On Mon, Aug 26, 2024 at 08:30:47AM -0400, manipm via illumos-developer wrote:
> I am also facing lmrc driver issue.  I recently purchased DELL R750xs with PERC H755(in passthrough mode) consisting of 5disk.
> 
> I am running latest DELL BIOS. Can someone help me on this.

I'm sorry that you ran into that problem. 

https://www.illumos.org/issues/15935

This is a known problem with the DELL PERC H755 which already occured
during the late stages of the driver development. The HBA firmware
pretty much drops dead after a while for no apparent reason, whether the
HBA is under load or completely idle. After a warm reset, the UEFI firmware
complains that it doesn't even see the device on the bus, and a full
power cycle is required to get the controller going again.

We've tried to root-cause this with the help from DELL and Broadcom, but
they didn't ever tell us what really happened to their firmware, nor
what our driver did to cause this. (That is not to say that our driver
did anything wrong in particular, nor that anything a driver could
possibly do should ever be considered an excuse for the HBA firmware to
behave that way.)

That being said, the PERC H355 worked flawlessly with lmrc, and so did
HBAs from other vendors such as Intel.

Hans

-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-08-26 15:23 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-07 10:31 panic inside lmrc driver maurilio.longo
2024-02-07 11:14 ` [developer] " Peter Tribble
2024-02-07 11:16 ` Hans Rosenfeld
2024-02-07 11:46   ` maurilio.longo
2024-02-07 12:04     ` maurilio.longo
2024-02-07 12:29     ` Hans Rosenfeld
2024-02-07 13:01     ` maurilio.longo
2024-02-07 16:47       ` maurilio.longo
2024-02-07 17:01         ` Hans Rosenfeld
2024-02-07 21:35         ` Hans Rosenfeld
2024-02-08  7:59           ` maurilio.longo
2024-02-08 10:42             ` Hans Rosenfeld
2024-02-08 11:01               ` maurilio.longo
2024-02-09  8:06                 ` maurilio.longo
2024-02-09 19:55                   ` Hans Rosenfeld
2024-02-09 20:36                     ` maurilio.longo
2024-02-12  8:11                       ` maurilio.longo
2024-02-13 19:32                         ` Hans Rosenfeld
2024-02-13 20:55                           ` maurilio.longo
2024-02-13 21:03                           ` maurilio.longo
2024-02-15  6:50                             ` maurilio.longo
2024-02-15  7:38                               ` Carsten Grzemba
2024-02-15  8:11                                 ` Toomas Soome
2024-02-15  9:44                                   ` maurilio.longo
2024-02-16 14:36                                     ` maurilio.longo
2024-02-16 15:20                                       ` maurilio.longo
2024-02-16 15:27                                         ` maurilio.longo
2024-02-16 15:42                                           ` Robert Mustacchi
2024-02-16 20:41                                             ` maurilio.longo
2024-02-19 21:14                                               ` maurilio.longo
2024-02-20  8:58                                                 ` maurilio.longo
2024-02-22 17:44                                                   ` maurilio.longo
2024-02-26  8:59                                                     ` maurilio.longo
2024-02-26  9:30                                                       ` Carsten Grzemba
2024-08-26 12:30                                                         ` manipm
2024-08-26 15:23                                                           ` Hans Rosenfeld
2024-02-13 19:23                       ` Hans Rosenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).