Intel NUC (Gen 11) panic during pcieadm illumos testing

public inbox for developer@lists.illumos.org (since 2011-08)
 help / color / mirror / Atom feed

* Intel NUC (Gen 11) panic during pcieadm illumos testing
@ 2024-02-22 16:08 Dan McDonald
  2024-02-22 18:16 ` [developer] " Dan McDonald
  2024-02-22 18:47 ` Robert Mustacchi
  0 siblings, 2 replies; 14+ messages in thread
From: Dan McDonald @ 2024-02-22 16:08 UTC (permalink / raw)
  To: illumos-developer

TL;DR:  I can panic my NUC 11 seemingly at will IF AND ONLY IF I use `pcieadm save-cfgspace` to a directory in /tmp.

...

To help out in recent igc(4D) bringups, I acquired a NUC 11.

For giggles, I ran the non-ZFS tests on it with this week's SmartOS release (with the addition of igc(4D) of course).   I got this crash:

> ::status
debugging crash dump vmcore.0 (64-bit) from nuc
operating system: 5.11 joyent_20240221T162850Z (i86pc)
git branch: igc
git rev: 01394a6ab5cbb41d518fdc5af9ab0948844923d6
image uuid: (not set)
panic message: pcieb-2: PCI(-X) Express Fatal Error. (0x43)
dump content: kernel pages only
> $C
fffffe0079d0cae0 vpanic()
fffffe0079d0cb80 pcieb_intr_handler+0x2aa(fffffe58c8663120, 0)
fffffe0079d0cbd0 apix_dispatch_by_vector+0x8c(20)
fffffe0079d0cc00 apix_dispatch_lowlevel+0x29(20, 0)
fffffe0079cb5a50 switch_sp_and_call+0x15()
fffffe0079cb5ab0 apix_do_interrupt+0xf3(fffffe0079cb5ac0, 0)
fffffe0079cb5ac0 _interrupt+0xc3()
fffffe0079cb5bb0 i86_mwait+0x12()
fffffe0079cb5be0 cpu_idle_mwait+0x14b()
fffffe0079cb5c00 idle+0xa8()
fffffe0079cb5c10 thread_start+0xb()
> 

AND I had `fmadm faulty -v` data:

[root@nuc ~]# fmadm faulty -v
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 22 05:49:07 61a2503a-bc29-4bd4-bc0c-5cf6b915f084  PCIEX-8000-1P  Major      
Host        : nuc
Platform    : NUC11PAHi5 Chassis_id  : G6PA13900DV9
Product_sn  :  
Fault class : fault.io.pciex.device-interr 67%
              fault.io.pciex.device-invreq 33%
Affects     : dev:////pci@0,0/pci8086,a0bc@1c/pci8086,3004@0
              dev:////pci@0,0/pci8086,a0bc@1c
                  faulted and taken out of service
Problem in  : "MB" (hc://:product-id=NUC11PAHi5:server-id=nuc:chassis-id=G6PA13900DV9/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=87/pciexdev=0/pciexfn=0)
              "MB" (hc://:product-id=NUC11PAHi5:server-id=nuc:chassis-id=G6PA13900DV9/motherboard=0/hostbridge=2/pciexrc=2)
                  faulted and taken out of service
FRU         : "MB" (hc://:product-id=NUC11PAHi5:server-id=nuc:chassis-id=G6PA13900DV9/motherboard=0)
                  faulty

Description : Either the transmitting device sent an invalid request or the
              receiving device is reporting an internal fault.
              Refer to http://illumos.org/msg/PCIEX-8000-1P for more
              information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : Ensure that the latest drivers and patches are installed.                Otherwise schedule a repair procedure to replace the affected
              device(s).  Use fmadm faulty to identify the devices or contact
              your illumos distribution team for support.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 22 05:49:07 783364d5-38f0-4fca-9f40-9460ac76a025  SUNOS-8000-J0  Major      
Host        : nuc
Platform    : NUC11PAHi5 Chassis_id  : G6PA13900DV9
Product_sn  :  
Fault class : defect.sunos.eft.unexpected_telemetry 50%
              fault.sunos.eft.unexpected_telemetry 50%
Problem in  : dev:////pci@0,0
                  faulted and taken out of service

Description : The diagnosis engine encountered telemetry from the listed
              devices for which it was unable to perform a diagnosis -                Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
              information.  Refer to http://illumos.org/msg/SUNOS-8000-J0 for
              more information.

Response    : Error reports have been logged for examination by your illumos
              distribution team.

Impact      : Automated diagnosis and response for these events will not occur.

Action      : Ensure that the latest illumos Kernel and Predictive Self-Healing
              (PSH) updates are installed.


A quick look at `pcieadm show-devs` gave me this device which matches up with the first fault report:

57/0/0  PCIe Gen 2x1   sdhost0        GL9755 SD Host Controller

I have no SD card in this machine.

BUT the process that was running:

	/usr/lib/pci/pcieadm save-cfgspace -a /tmp/pcieadm-priv.41234

might be important.

I tried it again w/o clearing FMA, and I induced another panic.

When I tried it a third time, clearing FMA and to a different directory on ZFS, I couldn't induce the panic.  When I pushed it back to the tmp directory (exactly as above), I did induce the panic.

What I've figured out:

- Can induce if and only if the dirname in /tmp is sufficiently long (TBD)

- Same device faults:
	87 aka 0x57 aka
	57/0/0  PCIe Gen 2x1   sdhost0        GL9755 SD Host Controller


And I have multiple coredumps for examination.

Thanks,
Dan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-02-22 16:08 Intel NUC (Gen 11) panic during pcieadm illumos testing Dan McDonald
@ 2024-02-22 18:16 ` Dan McDonald
  2024-02-22 18:47 ` Robert Mustacchi
  1 sibling, 0 replies; 14+ messages in thread
From: Dan McDonald @ 2024-02-22 18:16 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 324 bytes --]

For the PCIe inclined:

- I managed to get `pcieadm save-cfgspace` to output something about the faulty device (57-00-00.pci).

- I used `pcieadm show-cfgspace -f path/57-00-00.pci` and found some (to my naive reading) questionable things.

I'm attaching both the binary and the show-cfgspace output as files.

Dan

[-- Attachment #2: 57-00-00.pci --]
[-- Type: application/octet-stream, Size: 4096 bytes --]

[-- Attachment #3: show-cfg-57-00-00.txt --]
[-- Type: text/plain, Size: 18263 bytes --]

Device pcieadm-cfgspace-2/57-00-00.pci -- Type 0 Header
  Vendor ID: 0x17a0 -- Genesys Logic, Inc
  Device ID: 0x9755 -- GL9755 SD Host Controller
  Command: 0x46
    |--> I/O Space: disabled (0x0)
    |--> Memory Space: enabled (0x2)
    |--> Bus Master: enabled (0x4)
    |--> Special Cycle: disabled (0x0)
    |--> Memory Write and Invalidate: disabled (0x0)
    |--> VGA Palette Snoop: disabled (0x0)
    |--> Parity Error Response: enabled (0x40)
    |--> IDSEL Stepping/Wait Cycle Control: unsupported (0x0)
    |--> SERR# Enable: disabled (0x0)
    |--> Fast Back-to-Back Transactions: disabled (0x0)
    |--> Interrupt X: enabled (0x0)
  Status: 0x10
    |--> Immediate Readiness: unsupported (0x0)
    |--> Interrupt Status: not pending (0x0)
    |--> Capabilities List: supported (0x10)
    |--> 66 MHz Capable: unsupported (0x0)
    |--> Fast Back-to-Back Capable: unsupported (0x0)
    |--> Master Data Parity Error: no error (0x0)
    |--> DEVSEL# Timing: fast (0x0)
    |--> Signaled Target Abort: no (0x0)
    |--> Received Target Abort: no (0x0)
    |--> Received Master Abort: no (0x0)
    |--> Signaled System Error: no (0x0)
    |--> Detected Parity Error: no (0x0)
  Revision ID: 0x0
  Class Code: 0x80501
    |--> Class Code: 0x8
    |--> Sub-Class Code: 0xa
    |--> Programming Interface: 0x1
  Cache Line Size: 0x40 bytes
  Latency Timer: 0x0 cycles
  Header Type: 0x0
    |--> Header Layout: Device (0x0)
    |--> Multi-Function Device: no (0x0)
  BIST: 0x0
    |--> Completion Code: 0x0
    |--> Start BIST: 0x0
    |--> BIST Capable: unsupported (0x0)
  Base Address Register 0
    |--> Space: Memory Space (0x0)
    |--> Address: 0x6a400000
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Base Address Register 1
    |--> Space: Memory Space (0x0)
    |--> Address: 0x0
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Base Address Register 2
    |--> Space: Memory Space (0x0)
    |--> Address: 0x0
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Base Address Register 3
    |--> Space: Memory Space (0x0)
    |--> Address: 0x0
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Base Address Register 4
    |--> Space: Memory Space (0x0)
    |--> Address: 0x0
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Base Address Register 5
    |--> Space: Memory Space (0x0)
    |--> Address: 0x0
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Cardbus CIS Pointer: 0x0
  Subsystem Vendor ID: 0x8086 -- Intel Corporation
  Subsystem Device ID: 0x3004
  Expansion ROM: 0x0
    |--> Enable: disabled (0x0)
    |--> Validation Status: not supported (0x0)
    |--> Validation Details: 0x0
    |--> Base Address: 0x0
  Capabilities Pointer: 0x80
  Interrupt Line: 0xff
  Interrupt Pin: 0x1 -- INTA
  Min_Gnt: 0x0
  Min_Lat: 0x0
PCI Express Capability (0x10)
  Capability Register: 0x2
    |--> Version: 0x2
    |--> Device/Port Type: PCIe Endpoint (0x0)
    |--> Slot Implemented: No (0x0)
    |--> Interrupt Message Number: 0x0
  Device Capabilities: 0x5908da1
    |--> Max Payload Size Supported: 256 bytes (0x1)
    |--> Phantom Functions Supported: No (0x0)
    |--> Extended Tag Field: 8-bit (0x20)
    |--> L0s Acceptable Latency: 4 us (0x180)
    |--> L1 Acceptable Latency: 64 us (0xc00)
    |--> Role Based Error Reporting: supported (0x8000)
    |--> ERR_COR Subclass: unsupported (0x0)
    |--> Captured Slot Power Limit: 0x64
    |--> Captured Slot Power Limit Scale: 0.1x (0x4000000)
    |--> Function Level Reset: unsupported (0x0)
  Device Control: 0x2037
    |--> Correctable Error Reporting: enabled (0x1)
    |--> Non-Fatal Error Reporting: enabled (0x2)
    |--> Fatal Error Reporting: enabled (0x4)
    |--> Unsupported Request Reporting: disabled (0x0)
    |--> Relaxed Ordering: enabled (0x10)
    |--> Max Payload Size: 256 bytes (0x20)
    |--> Extended Tag Field: disabled (0x0)
    |--> Phantom Functions: disabled (0x0)
    |--> Aux Power PM: disabled (0x0)
    |--> No Snoop: disabled (0x0)
    |--> Max Read Request Size: 512 bytes (0x2000)
    |--> Bridge Configuration Retry / Function Level Reset: 0x0
  Device Status: 0x11
    |--> Correctable Error Detected: yes (0x1)
    |--> Non-Fatal Error Detected: no (0x0)
    |--> Fatal Error Detected: no (0x0)
    |--> Unsupported Request Detected: no (0x0)
    |--> AUX Power Detected: yes (0x10)
    |--> Transactions Pending: no (0x0)
    |--> Emergency Power Reduction Detected: no (0x0)
  Link Capabilities: 0x55477812
    |--> Maximum Link Speed: 5.0 GT/s (0x2)
    |--> Maximum Link Width: 0x1
    |--> ASPM Support: L1 (0x800)
    |--> L0s Exit Latency: >4us (0x7000)
    |--> L1 Exit Latency: >64us (0x30000)
    |--> Clock Power Management: supported (0x40000)
    |--> Surprise Down Error Reporting: unsupported (0x0)
    |--> Data Link Layer Active Reporting: unsupported (0x0)
    |--> Link Bandwidth Notification Capability: unsupported (0x0)
    |--> ASPM Optionality Compliance: compliant (0x400000)
    |--> Port Number: 0x55
  Link Control: 0x142
    |--> ASPM Control: L1 (0x2)
    |--> Read Completion Boundary: 64 byte (0x0)
    |--> Link Disable: not force disabled (0x0)
    |--> Retrain Link: 0x0
    |--> Common Clock Configuration: common (0x40)
    |--> Extended Sync: disabled (0x0)
    |--> Clock Power Management: enabled (0x100)
    |--> Hardware Autonomous Width: enabled (0x0)
    |--> Link Bandwidth Management Interrupt: disabled (0x0)
    |--> Link Autonomous Bandwidth Interrupt: disabled (0x0)
    |--> DRS Signaling Control: not reported (0x0)
  Link Status: 0x1012
    |--> Link Speed: 5.0 GT/s (0x2)
    |--> Link Width: 0x1
    |--> Link Training: no (0x0)
    |--> Slot Clock Configuration: common (0x1000)
    |--> Data Link Layer Link Active: no (0x0)
    |--> Link Bandwidth Management Status: no change (0x0)
    |--> Link Autonomous Bandwidth Status: no change (0x0)
  Slot Capabilities: 0x0
    |--> Attention Button Present: no (0x0)
    |--> Power Controller Present: no (0x0)
    |--> MRL Sensor Present: no (0x0)
    |--> Attention Indicator Present: no (0x0)
    |--> Power Indicator Present: no (0x0)
    |--> Hot-Plug Surprise: unsupported (0x0)
    |--> Hot-Plug Capable : unsupported (0x0)
    |--> Slot Power Limit Value: 0x0
    |--> Slot Power Limit Scale: 0x0
    |--> Electromechanical Interlock Present: no (0x0)
    |--> No Command Completed: unsupported (0x0)
    |--> Physical Slot Number: 0x0
  Slot Control: 0x0
    |--> Attention Button Pressed Reporting: disabled (0x0)
    |--> Power Fault Detected Reporting: disabled (0x0)
    |--> MRL Sensor Changed Reporting: disabled (0x0)
    |--> Presence Detect Changed Reporting: disabled (0x0)
    |--> Command Complete Interrupt: disabled (0x0)
    |--> Hot Plug Interrupt Enable: disabled (0x0)
    |--> Attention Indicator Control: reserved (0x0)
    |--> Power Indicator Control: reserved (0x0)
    |--> Power Controller Control: power on (0x0)
    |--> Electromechanical Interlock Control: 0x0
    |--> Data Link Layer State Changed: disabled (0x0)
    |--> Auto Slot Power Limit: enabled (0x0)
    |--> In-Band PD: enabled (0x0)
  Slot Status: 0x0
    |--> Attention Button Pressed: no (0x0)
    |--> Power Fault Detected: no (0x0)
    |--> MRL Sensor Changed: no (0x0)
    |--> Presence Detect Changed: no (0x0)
    |--> Command Complete: no (0x0)
    |--> MRL Sensor State: closed (0x0)
    |--> Presence Detect State: not present (0x0)
    |--> Electromechanical Interlock: disengaged (0x0)
    |--> Data Link Layer State Changed: no (0x0)
  Root Control: 0x0
    |--> CRS Software Visibility: disabled (0x0)
  Root Capabilities: 0x0
    |--> System Error on Correctable Error: disabled (0x0)
    |--> System Error on Non-Fatal Error: disabled (0x0)
    |--> System Error on Fatal Error: disabled (0x0)
    |--> PME Interrupt: disabled (0x0)
    |--> CRS Software Visibility: disabled (0x0)
  Root Status: 0x0
    |--> PME Requester ID: 0x0
    |--> PME Status: deasserted (0x0)
    |--> PME Pending: no (0x0)
  Device Capabilities 2: 0xc081f
    |--> Completion Timeout Ranges Supported: 0xf
      |--> 50us-10ms (0x1)
      |--> 10ms-250ms (0x2)
      |--> 250ms-4s (0x4)
      |--> 4s-64s (0x8)
    |--> Completion Timeout Disable: supported (0x10)
    |--> ARI Forwarding: unsupported (0x0)
    |--> AtomicOp Routing: unsupported (0x0)
    |--> 32-bit AtomicOp Completer: unsupported (0x0)
    |--> 64-bit AtomicOp Completer: unsupported (0x0)
    |--> 128-bit CAS Completer: unsupported (0x0)
    |--> No Ro-enabld PR-PR Passing: unsupported (0x0)
    |--> LTR Mechanism: supported (0x800)
    |--> TPH Completer: unsupported (0x0)
    |--> LN System CLS: unsupported (0x0)
    |--> 10-bit Tag Completer: unsupported (0x0)
    |--> 10-bit Tag Requester: unsupported (0x0)
    |--> OBFF: WAKE# and Message Signaling (0xc0000)
    |--> Extended Fmt Field Supported: unsupported (0x0)
    |--> End-End TLP Prefix Supported: unsupported (0x0)
    |--> Max End-End TLP Prefixes: 4 (0x0)
    |--> Emergency Power Reduction: unsupported (0x0)
    |--> Emergency Power Reduction Initialization Required: no (0x0)
    |--> Function Readiness Status: unsupported (0x0)
  Device Control 2: 0x400
    |--> Completion Timeout: 50us-50ms (0x0)
    |--> Completion Timeout Disabled: not disabled (0x0)
    |--> ARI Forwarding: disabled (0x0)
    |--> AtomicOp Requester: disabled (0x0)
    |--> AtomicOp Egress Blocking: unblocked (0x0)
    |--> ID-Based Ordering Request: disabled (0x0)
    |--> ID-Based Ordering Completion: disabled (0x0)
    |--> LTR Mechanism: enabled (0x400)
    |--> Emergency Power Reduction: not requested (0x0)
    |--> 10-bit Tag Requester: disabled (0x0)
    |--> OBFF: disabled (0x0)
    |--> End-End TLP Prefix Blocking: unblocked (0x0)
  Device Status 2: 0x0
  Link Capabilities 2: 0x6
    |--> Supported Link Speeds: 0x6
      |--> 2.5 GT/s (0x2)
      |--> 5.0 GT/s (0x4)
    |--> Crosslink: unsupported (0x0)
    |--> Lower SKP OS Generation Supported Speeds Vector: 0x0
    |--> Lower SKP OS Reception Supported Speeds Vector: 0x0
    |--> Retimer Presence Detect Supported: unsupported (0x0)
    |--> Two Retimers Presence Detect Supported: unsupported (0x0)
    |--> Device Readiness Status: unsupported (0x0)
  Link Control 2: 0x2
    |--> Target Link Speed: 5.0 GT/s (0x2)
    |--> Enter Compliance: no (0x0)
    |--> Hardware Autonomous Speed Disable: not disabled (0x0)
    |--> Selectable De-emphasis: -6 dB (0x0)
    |--> TX Margin: 0x0
    |--> Enter Modified Compliance: no (0x0)
    |--> Compliance SOS: disabled (0x0)
    |--> Compliance Preset/De-emphasis: 0x0
  Link Status 2: 0x0
    |--> Current De-emphasis Level: -6 dB (0x0)
    |--> Equalization 8.0 GT/s Complete: no (0x0)
    |--> Equalization 8.0 GT/s Phase 1: unsuccessful (0x0)
    |--> Equalization 8.0 GT/s Phase 2: unsuccessful (0x0)
    |--> Equalization 8.0 GT/s Phase 3: unsuccessful (0x0)
    |--> Link Equalization Request 8.0 GT/s: not requested (0x0)
    |--> Retimer Presence Detected: no (0x0)
    |--> Two Retimers Presence Detected: no (0x0)
    |--> Crosslink Resolution: unsupported (0x0)
    |--> Downstream Component Presence: link down - undetermined (0x0)
    |--> DRS Message Received: no (0x0)
  Slot Capabilities 2: 0x0
    |--> In-Band PD Disable: unsupported (0x0)
  Slot Control 2: 0x0
  Slot Status 2: 0x0
Message Signaled Interrupts Capability (0x5)
  Message Control: 0x80
    |--> MSI Enable: disabled (0x0)
    |--> Multiple Message Capable: 1 vector (0x0)
    |--> Multiple Message Enabled: 1 vector (0x0)
    |--> 64-bit Address Capable: supported (0x80)
    |--> Per-Vector Masking Capable: unsupported (0x0)
    |--> Extended Message Data Capable: unsupported (0x0)
    |--> extended Message Data Enable: unsupported (0x0)
  Message Address: 0x0
  Upper Message Address: 0x0
  Message Data: 0x0
PCI Power Management Capability (0x1)
  Power Management Capabilities: 0xf7c3
    |--> Version: 0x3
    |--> PME Clock: not required (0x0)
    |--> Immediate Readiness on Return to D0: no (0x0)
    |--> Device Specific Initialization: no (0x0)
    |--> Auxiliary Current: 375 mA (0x1c0)
    |--> D1: supported (0x200)
    |--> D2: supported (0x400)
    |--> PME Support: 0xf000
      |--> D1 (0x1000)
      |--> D2 (0x2000)
      |--> D3hot (0x4000)
      |--> D3cold (0x8000)
Vendor Specific Extended Capability Capability (0xb)
  Capability Header: 0x1081000b
    |--> Capability ID: 0xb
    |--> Capability Version: 0x1
    |--> Next Capability Offset: 0x108
  Vendor-Specific Header: 0x8117a0
    |--> ID: 0x17a0
    |--> Revision: 0x1
    |--> Length: 0x8
Latency Tolerance Reporting Capability (0x18)
  Capability Header: 0x11010018
    |--> Capability ID: 0x18
    |--> Capability Version: 0x1
    |--> Next Capability Offset: 0x110
  Max Snoop Latency: 0x1003
    |--> Latency Value: 0x3
    |--> Latency Scale: 1,048,576 ns (0x1000)
  Max No-Snoop Latency: 0x1003
    |--> Latency Value: 0x3
    |--> Latency Scale: 1,048,576 ns (0x1000)
L1 PM Substates Capability (0x1e)
  Capability Header: 0x2001001e
    |--> Capability ID: 0x1e
    |--> Capability Version: 0x1
    |--> Next Capability Offset: 0x200
  L1 PM Substates Capabilities: 0xfaff9f
    |--> PCI-PM L1.2: supported (0x1)
    |--> PCI-PM L1.1: supported (0x2)
    |--> ASPM L1.2: supported (0x4)
    |--> ASPM L1.1: supported (0x8)
    |--> L1 PM Substates: supported (0x10)
    |--> Link Activation: unsupported (0x0)
    |--> Port Common_Mode_Restore_Time: 0xff
    |--> Port T_POWER_ON Scale: 100 us (0x20000)
    |--> Port T_POWER_ON Value: 0x1f
  L1 PM Substates Control 1: 0x4125000f
    |--> PCI-PM L1.2: enabled (0x1)
    |--> PCI-PM L1.1: enabled (0x2)
    |--> ASPM L1.2: enabled (0x4)
    |--> ASPM L1.1: enabled (0x8)
    |--> Link Activation Interrupt Enable: disabled (0x0)
    |--> Link Activation Control: disabled (0x0)
    |--> Common_Mode_Restore_Time: 0x0
    |--> LTR L1.2 Threshold Value: 0x125
    |--> LTR L1.2 Threshold Scale: 1024 ns (0x40000000)
  L1 PM Substates Control 2: 0xfa
    |--> T_POWER_ON Scale: 100 us (0x2)
    |--> T_POWER_ON Value: 0x1f
Advanced Error Reporting Capability (0x1)
  Capability Header: 0x10001
    |--> Capability ID: 0x1
    |--> Capability Version: 0x1
    |--> Next Capability Offset: 0x0
  Uncorrectable Error Status: 0x8000
    |--> Data Link Protocol Error: 0x0
    |--> Surprise Down Error: 0x0
    |--> Poisoned TLP Received: 0x0
    |--> Flow Control Protocol Error: 0x0
    |--> Completion Timeout: 0x0
    |--> Completion Abort: 0x1
    |--> Unexpected Completion: 0x0
    |--> Receiver Overflow: 0x0
    |--> Malformed TLP: 0x0
    |--> ECRC Error: 0x0
    |--> Unsupported Request Error: 0x0
    |--> ACS Violation: 0x0
    |--> Uncorrectable Internal Error: 0x0
    |--> MC Blocked TLP: 0x0
    |--> AtomicOp Egress Blocked: 0x0
    |--> TLP Prefix Blocked Error: 0x0
    |--> Poisoned TLP Egress Blocked: 0x0
  Uncorrectable Error Mask: 0x180000
    |--> Data Link Protocol Error: 0x0
    |--> Surprise Down Error: 0x0
    |--> Poisoned TLP Received: 0x0
    |--> Flow Control Protocol Error: 0x0
    |--> Completion Timeout: 0x0
    |--> Completion Abort: 0x0
    |--> Unexpected Completion: 0x0
    |--> Receiver Overflow: 0x0
    |--> Malformed TLP: 0x0
    |--> ECRC Error: 0x1
    |--> Unsupported Request Error: 0x1
    |--> ACS Violation: 0x0
    |--> Uncorrectable Internal Error: 0x0
    |--> MC Blocked TLP: 0x0
    |--> AtomicOp Egress Blocked: 0x0
    |--> TLP Prefix Blocked Error: 0x0
    |--> Poisoned TLP Egress Blocked: 0x0
  Uncorrectable Error Severity: 0x62030
    |--> Data Link Protocol Error: 0x1
    |--> Surprise Down Error: 0x1
    |--> Poisoned TLP Received: 0x0
    |--> Flow Control Protocol Error: 0x1
    |--> Completion Timeout: 0x0
    |--> Completion Abort: 0x0
    |--> Unexpected Completion: 0x0
    |--> Receiver Overflow: 0x1
    |--> Malformed TLP: 0x1
    |--> ECRC Error: 0x0
    |--> Unsupported Request Error: 0x0
    |--> ACS Violation: 0x0
    |--> Uncorrectable Internal Error: 0x0
    |--> MC Blocked TLP: 0x0
    |--> AtomicOp Egress Blocked: 0x0
    |--> TLP Prefix Blocked Error: 0x0
    |--> Poisoned TLP Egress Blocked: 0x0
  Correctable Error Status: 0x0
    |--> Receiver Error: 0x0
    |--> Bad TLP: 0x0
    |--> Bad DLLP: 0x0
    |--> REPLAY_NUM Rollover: 0x0
    |--> Replay timer Timeout: 0x0
    |--> Advisory Non-Fatal Error: 0x0
    |--> Correctable Internal Error: 0x0
    |--> Header Log Overflow: 0x0
  Correctable Error Mask: 0x0
    |--> Receiver Error: 0x0
    |--> Bad TLP: 0x0
    |--> Bad DLLP: 0x0
    |--> REPLAY_NUM Rollover: 0x0
    |--> Replay timer Timeout: 0x0
    |--> Advisory Non-Fatal Error: 0x0
    |--> Correctable Internal Error: 0x0
    |--> Header Log Overflow: 0x0
  Advanced Error Capabilities and Control: 0xaf
    |--> First Error Pointer: 0xf
    |--> ECRC Generation Capable: supported (0x20)
    |--> ECRC Generation Enable: disabled (0x0)
    |--> ECRC Check Capable: supported (0x80)
    |--> ECRC Check Enable: disabled (0x0)
  Header Log 0: 0x0
  Header Log 1: 0x0
  Header Log 2: 0x0
  Header Log 3: 0x0
  Root Error Command: 0x0
    |--> Correctable Error Reporting: disabled (0x0)
    |--> Non-Fatal Error Reporting: disabled (0x0)
    |--> Fatal Error Reporting: disabled (0x0)
  Root Error Status: 0x0
    |--> ERR_COR Received: 0x0
    |--> Multiple ERR_COR Received: 0x0
    |--> ERR_FATAL/NONFATAL Received: 0x0
    |--> Multiple ERR_FATAL/NONFATAL Received: 0x0
    |--> First Uncorrectable Fatal: 0x0
    |--> Non-Fatal Error Messages Received: 0x0
    |--> Fatal Error Messages Received: 0x0
    |--> ERR_COR Subclass: ECS Legacy (0x0)
    |--> Advanced Error Interrupt Message: 0x0
  Error Source Identification: 0x0
    |--> ERR_COR Source: 0x0
    |--> ERR_FATAL/NONFATAL Source: 0x0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-02-22 16:08 Intel NUC (Gen 11) panic during pcieadm illumos testing Dan McDonald
  2024-02-22 18:16 ` [developer] " Dan McDonald
@ 2024-02-22 18:47 ` Robert Mustacchi
  2024-02-22 18:58   ` Dan McDonald
  1 sibling, 1 reply; 14+ messages in thread
From: Robert Mustacchi @ 2024-02-22 18:47 UTC (permalink / raw)
  To: illumos-developer, Dan McDonald

Hey Dan,

On 2/22/24 08:08, Dan McDonald wrote:
> TL;DR:  I can panic my NUC 11 seemingly at will IF AND ONLY IF I use `pcieadm save-cfgspace` to a directory in /tmp.
> 
> ...
> 
> To help out in recent igc(4D) bringups, I acquired a NUC 11.
> 
> For giggles, I ran the non-ZFS tests on it with this week's SmartOS release (with the addition of igc(4D) of course).   I got this crash:
> 
>> ::status
> debugging crash dump vmcore.0 (64-bit) from nuc
> operating system: 5.11 joyent_20240221T162850Z (i86pc)
> git branch: igc
> git rev: 01394a6ab5cbb41d518fdc5af9ab0948844923d6
> image uuid: (not set)
> panic message: pcieb-2: PCI(-X) Express Fatal Error. (0x43)
> dump content: kernel pages only
>> $C
> fffffe0079d0cae0 vpanic()
> fffffe0079d0cb80 pcieb_intr_handler+0x2aa(fffffe58c8663120, 0)
> fffffe0079d0cbd0 apix_dispatch_by_vector+0x8c(20)
> fffffe0079d0cc00 apix_dispatch_lowlevel+0x29(20, 0)
> fffffe0079cb5a50 switch_sp_and_call+0x15()
> fffffe0079cb5ab0 apix_do_interrupt+0xf3(fffffe0079cb5ac0, 0)
> fffffe0079cb5ac0 _interrupt+0xc3()
> fffffe0079cb5bb0 i86_mwait+0x12()
> fffffe0079cb5be0 cpu_idle_mwait+0x14b()
> fffffe0079cb5c00 idle+0xa8()
> fffffe0079cb5c10 thread_start+0xb()

So this generally means that we've received a PCIe AER (Advanced Error
Reporting) issue on the bridge. Errors that come are classified into
different types somewhat based on the spec. In my opinion we should
probably end up handling more of these in ways that deal with it more
generally, though some of that depends on what's going on. Given that
pcieadm is triggering this, we're unlikely to be performing DMA, so at
least we won't have poisoned memory, but this is something we can
improve upon.

> AND I had `fmadm faulty -v` data:
> 
> [root@nuc ~]# fmadm faulty -v
> --------------- ------------------------------------  -------------- ---------
> TIME            EVENT-ID                              MSG-ID         SEVERITY
> --------------- ------------------------------------  -------------- ---------
> Feb 22 05:49:07 61a2503a-bc29-4bd4-bc0c-5cf6b915f084  PCIEX-8000-1P  Major      
> Host        : nuc
> Platform    : NUC11PAHi5 Chassis_id  : G6PA13900DV9
> Product_sn  :  
> Fault class : fault.io.pciex.device-interr 67%
>               fault.io.pciex.device-invreq 33%
> Affects     : dev:////pci@0,0/pci8086,a0bc@1c/pci8086,3004@0
>               dev:////pci@0,0/pci8086,a0bc@1c
>                   faulted and taken out of service
> Problem in  : "MB" (hc://:product-id=NUC11PAHi5:server-id=nuc:chassis-id=G6PA13900DV9/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=87/pciexdev=0/pciexfn=0)
>               "MB" (hc://:product-id=NUC11PAHi5:server-id=nuc:chassis-id=G6PA13900DV9/motherboard=0/hostbridge=2/pciexrc=2)
>                   faulted and taken out of service
> FRU         : "MB" (hc://:product-id=NUC11PAHi5:server-id=nuc:chassis-id=G6PA13900DV9/motherboard=0)
>                   faulty
> 
> Description : Either the transmitting device sent an invalid request or the
>               receiving device is reporting an internal fault.
>               Refer to http://illumos.org/msg/PCIEX-8000-1P for more
>               information.
> 
> Response    : One or more device instances may be disabled
> 
> Impact      : Loss of services provided by the device instances associated with
>               this fault
> 
> Action      : Ensure that the latest drivers and patches are installed.                Otherwise schedule a repair procedure to replace the affected
>               device(s).  Use fmadm faulty to identify the devices or contact
>               your illumos distribution team for support.
> 
> --------------- ------------------------------------  -------------- ---------
> TIME            EVENT-ID                              MSG-ID         SEVERITY
> --------------- ------------------------------------  -------------- ---------
> Feb 22 05:49:07 783364d5-38f0-4fca-9f40-9460ac76a025  SUNOS-8000-J0  Major      
> Host        : nuc
> Platform    : NUC11PAHi5 Chassis_id  : G6PA13900DV9
> Product_sn  :  
> Fault class : defect.sunos.eft.unexpected_telemetry 50%
>               fault.sunos.eft.unexpected_telemetry 50%
> Problem in  : dev:////pci@0,0
>                   faulted and taken out of service
> 
> Description : The diagnosis engine encountered telemetry from the listed
>               devices for which it was unable to perform a diagnosis -                Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
>               information.  Refer to http://illumos.org/msg/SUNOS-8000-J0 for
>               more information.
> 
> Response    : Error reports have been logged for examination by your illumos
>               distribution team.
> 
> Impact      : Automated diagnosis and response for these events will not occur.
> 
> Action      : Ensure that the latest illumos Kernel and Predictive Self-Healing
>               (PSH) updates are installed.
> 
> 
> A quick look at `pcieadm show-devs` gave me this device which matches up with the first fault report:
> 
> 57/0/0  PCIe Gen 2x1   sdhost0        GL9755 SD Host Controller
> 
> I have no SD card in this machine.
> 
> BUT the process that was running:
> 
>         /usr/lib/pci/pcieadm save-cfgspace -a /tmp/pcieadm-priv.41234
> 
> might be important.

So I think the first thing that we'll want to look at here are the
specifics of the ereports. So what I'd like to ask for is if there is
any information in fmdump -e available and then in each dump can you
look at the following please:

> ::walk ereportq_pend | ::ereport -v
> ::walk ereportd_pend | ::ereport -v

In addition, if we can get the kernel stack of where the pcieadm process
is, that might be helpful, just for related information.

Sorry for the trouble.

Robert

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-02-22 18:47 ` Robert Mustacchi
@ 2024-02-22 18:58   ` Dan McDonald
  2024-02-22 19:17     ` Robert Mustacchi
  0 siblings, 1 reply; 14+ messages in thread
From: Dan McDonald @ 2024-02-22 18:58 UTC (permalink / raw)
  To: illumos-developer

You got the walker names slightly wrong, but I figured it out, and here's everything you wanted:

1.) ereportq_pend

> ::walk ereportq_pend | ::ereport -v
> 

2.) erporteq_dump

> ::walk ereportq_dump | ::ereport -v
class='ereport.io.pci.fabric'
ena=50aa0eefac301801
detector
    version=00
    scheme='dev'
    device-path='/pci@0,0/pci8086,a0bc@1c'
bdf=00e0
device_id=a0bc
vendor_id=8086
rev_id=20
dev_type=0040
pcie_off=0040
pcix_off=0000
aer_off=0100
ecc_ver=0000
pci_status=0010
pci_command=0047
pci_bdg_sec_status=3000
pci_bdg_ctrl=0003
pcie_status=0010
pcie_command=0027
pcie_dev_cap=00008001
pcie_adv_ctl=00000000
pcie_ue_status=00000000                pcie_ue_mask=00100000
pcie_ue_sev=00060011
pcie_ue_hdr0=00000000
pcie_ue_hdr1=00000000
pcie_ue_hdr2=00000000
pcie_ue_hdr3=00000000
pcie_ce_status=00000000
pcie_ce_mask=00000000
pcie_rp_status=00000000
pcie_rp_control=0000
pcie_adv_rp_status=00000001
pcie_adv_rp_command=00000007
pcie_adv_rp_ce_src_id=5700
pcie_adv_rp_ue_src_id=0000
remainder=00000001
severity=00000001

class='ereport.io.pci.fabric'
ena=50aa0ef638701801
detector
    version=00
    scheme='dev'
    device-path='/pci@0,0/pci8086,a0bc@1c/pci8086,3004@0'
bdf=5700
device_id=9755
vendor_id=17a0
rev_id=00
dev_type=0000
pcie_off=0080
pcix_off=0000
aer_off=0200
ecc_ver=0000
pci_status=0010
pci_command=0046
pcie_status=0011
pcie_command=2037
pcie_dev_cap=05908da1
pcie_adv_ctl=000000af
pcie_ue_status=00008000
pcie_ue_mask=00180000
pcie_ue_sev=00062030
pcie_ue_hdr0=00000000
pcie_ue_hdr1=00000000
pcie_ue_hdr2=00000000
pcie_ue_hdr3=00000000
pcie_ce_status=00001000
pcie_ce_mask=00000000
pcie_ue_tgt_trans=00000000
pcie_ue_tgt_addr=0000000000000000
pcie_ue_tgt_bdf=ffff
remainder=00000000
severity=00000042

> 

3.) Stack of pcieadm process:

> ::pgrep pcieadm |::ps -t       
S PID    PPID   PGID   SID    UID    FLAGS      ADDR             NAME
R 4926   4762   4926   4762   0      0x4a004000 fffffe58f9698078 pcieadm
        T  0xfffffe58dd6ec080 <TS_ONPROC>
> 0xfffffe58dd6ec080::findstack -v
stack pointer for thread fffffe58dd6ec080 (pcieadm/1): fffffe0079e3a4e0
  fffffe0079e3a510 apix_setspl+0x22(fffffffff38ba231)
  fffffe0079e3a550 do_splx+0x84(b)
  fffffe0079e3a5e0 xc_common+0x249(fffffffffb831670, fffffe58b99cb3f8, fffffe0079e3a648, 0, fffffe0079e3a680, 2)
  fffffe0079e3a630 xc_call+0x3d(fffffe58b99cb3f8, fffffe0079e3a648, 0, fffffe0079e3a680, fffffffffb831670)
  fffffe0079e3a6f0 hat_tlb_inval_range+0x28d(ff17a2, fffffe0079e3a708)
  fffffe0079e3a740 x86pte_mapin+0x42(ff17a2, 14c, fffffe58b9bfa5c0)
  fffffe0079e3a760 x86pte_release_pagetable+0x1d(fffffe58b9bfa5c0)
  fffffe0079e3a7e0 x86pte_set+0x108(fffffe58b9bfa5c0, 14c, 80000000c5700473, 0)
  fffffe0079e3a880 hati_pte_map+0x449(f1, f, b, fffffe0079e3a910, fffffe58cba27000, fffffe58b9bfa5c0)
  fffffe0079e3a910 0xb()
  fffffe0079e3aa20 pci_cfgacc_mmio+0xba(fffffe0079e3aa78)
  fffffe0079e3aa50 pci_cfgacc_acc+0x69(fffffe0079e3aa78)
  fffffe0079e3aae0 pcitool_cfg_access+0x180(fffffe0079e3ab10, 0, 0)
  fffffe0079e3abc0 pcitool_dev_reg_ops+0x212(fffffe58b9b84688, fffffc7fffdfe5c0, 50435401, 202001)
  fffffe0079e3ac30 pci_common_ioctl+0xc5(fffffe58b9b84688, 7d000000fd, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18)
  fffffe0079e3acc0 npe_ioctl+0xa5(7d000000fd, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18)
  fffffe0079e3ad00 cdev_ioctl+0x3f(7d000000fd, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18)
  fffffe0079e3ad50 spec_ioctl+0x55(fffffe58f99d8d00, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18, 0)
  fffffe0079e3ade0 fop_ioctl+0x40(fffffe58f99d8d00, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18, 0)
  fffffe0079e3af00 ioctl+0x144(5, 50435401, fffffc7fffdfe5c0)
  fffffe0079e3af10 sys_syscall+0x1a8()
>  


Combine that with the attachments I sent, and maybe you'll see something I don't?

Dan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-02-22 18:58   ` Dan McDonald
@ 2024-02-22 19:17     ` Robert Mustacchi
  2024-02-22 19:38       ` Dan McDonald
  0 siblings, 1 reply; 14+ messages in thread
From: Robert Mustacchi @ 2024-02-22 19:17 UTC (permalink / raw)
  To: illumos-developer, Dan McDonald

On 2/22/24 10:58, Dan McDonald wrote:
> You got the walker names slightly wrong, but I figured it out, and here's everything you wanted:
> 
> 1.) ereportq_pend
> 
>> ::walk ereportq_pend | ::ereport -v
> 
> 2.) erporteq_dump
> 
>> ::walk ereportq_dump | ::ereport -v
> class='ereport.io.pci.fabric'
> ena=50aa0eefac301801
> detector
>     version=00
>     scheme='dev'
>     device-path='/pci@0,0/pci8086,a0bc@1c'
> bdf=00e0
> device_id=a0bc
> vendor_id=8086
> rev_id=20
> dev_type=0040
> pcie_off=0040
> pcix_off=0000
> aer_off=0100
> ecc_ver=0000
> pci_status=0010
> pci_command=0047
> pci_bdg_sec_status=3000
> pci_bdg_ctrl=0003
> pcie_status=0010
> pcie_command=0027
> pcie_dev_cap=00008001
> pcie_adv_ctl=00000000
> pcie_ue_status=00000000                pcie_ue_mask=00100000
> pcie_ue_sev=00060011
> pcie_ue_hdr0=00000000
> pcie_ue_hdr1=00000000
> pcie_ue_hdr2=00000000
> pcie_ue_hdr3=00000000
> pcie_ce_status=00000000
> pcie_ce_mask=00000000
> pcie_rp_status=00000000
> pcie_rp_control=0000
> pcie_adv_rp_status=00000001
> pcie_adv_rp_command=00000007
> pcie_adv_rp_ce_src_id=5700
> pcie_adv_rp_ue_src_id=0000
> remainder=00000001
> severity=00000001

The ue_status and ce_status are zero here, but the adv_rp_status seems
likely what we were seeing here and we'll need to go through with. The
adv_rp_status is an ERR_COR received (correctable error). The ce_src_id
the requestor id that was having trouble, which relates to the 57/0/0.


> class='ereport.io.pci.fabric'
> ena=50aa0ef638701801
> detector
>     version=00
>     scheme='dev'
>     device-path='/pci@0,0/pci8086,a0bc@1c/pci8086,3004@0'
> bdf=5700
> device_id=9755
> vendor_id=17a0
> rev_id=00
> dev_type=0000
> pcie_off=0080
> pcix_off=0000
> aer_off=0200
> ecc_ver=0000
> pci_status=0010
> pci_command=0046
> pcie_status=0011
> pcie_command=2037
> pcie_dev_cap=05908da1
> pcie_adv_ctl=000000af
> pcie_ue_status=00008000
> pcie_ue_mask=00180000
> pcie_ue_sev=00062030
> pcie_ue_hdr0=00000000
> pcie_ue_hdr1=00000000
> pcie_ue_hdr2=00000000
> pcie_ue_hdr3=00000000
> pcie_ce_status=00001000
> pcie_ce_mask=00000000
> pcie_ue_tgt_trans=00000000
> pcie_ue_tgt_addr=0000000000000000
> pcie_ue_tgt_bdf=ffff
> remainder=00000000
> severity=00000042

So in these the ce_status / ue_status are what we want to look at. The
ue_status has bit 15 set, which is Completer Abort Status. The ce_status
has bit 12 which is the Replay Timer Timeout Status. The severity here
is 0x42 which we'll have to go translate into why that turned into a
panic there. I've not had time to go into that yet.

> 3.) Stack of pcieadm process:
> 
>> ::pgrep pcieadm |::ps -t       
> S PID    PPID   PGID   SID    UID    FLAGS      ADDR             NAME
> R 4926   4762   4926   4762   0      0x4a004000 fffffe58f9698078 pcieadm
>         T  0xfffffe58dd6ec080 <TS_ONPROC>
>> 0xfffffe58dd6ec080::findstack -v
> stack pointer for thread fffffe58dd6ec080 (pcieadm/1): fffffe0079e3a4e0
>   fffffe0079e3a510 apix_setspl+0x22(fffffffff38ba231)
>   fffffe0079e3a550 do_splx+0x84(b)
>   fffffe0079e3a5e0 xc_common+0x249(fffffffffb831670, fffffe58b99cb3f8, fffffe0079e3a648, 0, fffffe0079e3a680, 2)
>   fffffe0079e3a630 xc_call+0x3d(fffffe58b99cb3f8, fffffe0079e3a648, 0, fffffe0079e3a680, fffffffffb831670)
>   fffffe0079e3a6f0 hat_tlb_inval_range+0x28d(ff17a2, fffffe0079e3a708)
>   fffffe0079e3a740 x86pte_mapin+0x42(ff17a2, 14c, fffffe58b9bfa5c0)
>   fffffe0079e3a760 x86pte_release_pagetable+0x1d(fffffe58b9bfa5c0)
>   fffffe0079e3a7e0 x86pte_set+0x108(fffffe58b9bfa5c0, 14c, 80000000c5700473, 0)
>   fffffe0079e3a880 hati_pte_map+0x449(f1, f, b, fffffe0079e3a910, fffffe58cba27000, fffffe58b9bfa5c0)
>   fffffe0079e3a910 0xb()
>   fffffe0079e3aa20 pci_cfgacc_mmio+0xba(fffffe0079e3aa78)
>   fffffe0079e3aa50 pci_cfgacc_acc+0x69(fffffe0079e3aa78)
>   fffffe0079e3aae0 pcitool_cfg_access+0x180(fffffe0079e3ab10, 0, 0)
>   fffffe0079e3abc0 pcitool_dev_reg_ops+0x212(fffffe58b9b84688, fffffc7fffdfe5c0, 50435401, 202001)
>   fffffe0079e3ac30 pci_common_ioctl+0xc5(fffffe58b9b84688, 7d000000fd, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18)
>   fffffe0079e3acc0 npe_ioctl+0xa5(7d000000fd, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18)
>   fffffe0079e3ad00 cdev_ioctl+0x3f(7d000000fd, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18)
>   fffffe0079e3ad50 spec_ioctl+0x55(fffffe58f99d8d00, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18, 0)
>   fffffe0079e3ade0 fop_ioctl+0x40(fffffe58f99d8d00, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18, 0)
>   fffffe0079e3af00 ioctl+0x144(5, 50435401, fffffc7fffdfe5c0)
>   fffffe0079e3af10 sys_syscall+0x1a8()

So the thing I'd like to ask here is if we can take the first argument
to pcitool_cfg_access() and print it as a pcitool_reg_t. Can you do that
in all the dumps given you have multiple? That'll tell us if it's always
trying to get at a given register or not.

Also, I think another question is are all the dumps similar or not.

Robert


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-02-22 19:17     ` Robert Mustacchi
@ 2024-02-22 19:38       ` Dan McDonald
  2024-03-07  7:25         ` Dan McDonald
  0 siblings, 1 reply; 14+ messages in thread
From: Dan McDonald @ 2024-02-22 19:38 UTC (permalink / raw)
  To: illumos-developer

On Feb 22, 2024, at 2:17 PM, Robert Mustacchi <rm@fingolfin.org> wrote:
> 
> The ue_status and ce_status are zero here, but the adv_rp_status seems
> likely what we were seeing here and we'll need to go through with. The
> adv_rp_status is an ERR_COR received (correctable error). The ce_src_id
> the requestor id that was having trouble, which relates to the 57/0/0.

Cool!

>> 3.) Stack of pcieadm process:
>> 
>>> ::pgrep pcieadm |::ps -t       
>> S PID    PPID   PGID   SID    UID    FLAGS      ADDR             NAME
>> R 4926   4762   4926   4762   0      0x4a004000 fffffe58f9698078 pcieadm
>>        T  0xfffffe58dd6ec080 <TS_ONPROC>
>>> 0xfffffe58dd6ec080::findstack -v
>> stack pointer for thread fffffe58dd6ec080 (pcieadm/1): fffffe0079e3a4e0
>>  fffffe0079e3a510 apix_setspl+0x22(fffffffff38ba231)
>>  fffffe0079e3a550 do_splx+0x84(b)
>>  fffffe0079e3a5e0 xc_common+0x249(fffffffffb831670, fffffe58b99cb3f8, fffffe0079e3a648, 0, fffffe0079e3a680, 2)
>>  fffffe0079e3a630 xc_call+0x3d(fffffe58b99cb3f8, fffffe0079e3a648, 0, fffffe0079e3a680, fffffffffb831670)
>>  fffffe0079e3a6f0 hat_tlb_inval_range+0x28d(ff17a2, fffffe0079e3a708)
>>  fffffe0079e3a740 x86pte_mapin+0x42(ff17a2, 14c, fffffe58b9bfa5c0)
>>  fffffe0079e3a760 x86pte_release_pagetable+0x1d(fffffe58b9bfa5c0)
>>  fffffe0079e3a7e0 x86pte_set+0x108(fffffe58b9bfa5c0, 14c, 80000000c5700473, 0)
>>  fffffe0079e3a880 hati_pte_map+0x449(f1, f, b, fffffe0079e3a910, fffffe58cba27000, fffffe58b9bfa5c0)
>>  fffffe0079e3a910 0xb()
>>  fffffe0079e3aa20 pci_cfgacc_mmio+0xba(fffffe0079e3aa78)
>>  fffffe0079e3aa50 pci_cfgacc_acc+0x69(fffffe0079e3aa78)
>>  fffffe0079e3aae0 pcitool_cfg_access+0x180(fffffe0079e3ab10, 0, 0)
>>  fffffe0079e3abc0 pcitool_dev_reg_ops+0x212(fffffe58b9b84688, fffffc7fffdfe5c0, 50435401, 202001)
>>  fffffe0079e3ac30 pci_common_ioctl+0xc5(fffffe58b9b84688, 7d000000fd, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18)
>>  fffffe0079e3acc0 npe_ioctl+0xa5(7d000000fd, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18)
>>  fffffe0079e3ad00 cdev_ioctl+0x3f(7d000000fd, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18)
>>  fffffe0079e3ad50 spec_ioctl+0x55(fffffe58f99d8d00, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18, 0)
>>  fffffe0079e3ade0 fop_ioctl+0x40(fffffe58f99d8d00, 50435401, fffffc7fffdfe5c0, 202001, fffffe58f96ee3f8, fffffe0079e3ae18, 0)
>>  fffffe0079e3af00 ioctl+0x144(5, 50435401, fffffc7fffdfe5c0)
>>  fffffe0079e3af10 sys_syscall+0x1a8()
> 
> So the thing I'd like to ask here is if we can take the first argument
> to pcitool_cfg_access() and print it as a pcitool_reg_t. Can you do that
> in all the dumps given you have multiple? That'll tell us if it's always
> trying to get at a given register or not.

Some shell pipline arcana, and:

[root@nuc /var/crash/volatile]# for a in 0 1 2 3 4 ; do echo "::pgrep pcieadm |::print proc_t p_tlist |::findstack -v" | mdb $a | grep pcitool_cfg_access | awk -F\( '{print $2}' | awk -F, '{print $1 "::print pcitool_reg_t"}' | mdb $a; done
{
    user_version = 0x2
    drvr_version = 0
    bus_no = 0x57
    dev_no = 0
    func_no = 0
    barnum = 0
    offset = 0x50
    acc_attr = 0x2
    padding1 = 0
    data = 0
    status = 0
    padding2 = 0
    phys_addr = 0
}
{
    user_version = 0x2
    drvr_version = 0
    bus_no = 0x57
    dev_no = 0
    func_no = 0
    barnum = 0
    offset = 0x18
    acc_attr = 0x2
    padding1 = 0
    data = 0
    status = 0
    padding2 = 0
    phys_addr = 0
}
{
    user_version = 0x2
    drvr_version = 0
    bus_no = 0x57
    dev_no = 0
    func_no = 0
    barnum = 0
    offset = 0x4c
    acc_attr = 0x2
    padding1 = 0
    data = 0
    status = 0
    padding2 = 0
    phys_addr = 0
}
{
    user_version = 0x2
    drvr_version = 0
    bus_no = 0x57
    dev_no = 0
    func_no = 0
    barnum = 0
    offset = 0x50
    acc_attr = 0x2
    padding1 = 0
    data = 0
    status = 0
    padding2 = 0
    phys_addr = 0
}
{
    user_version = 0x2
    drvr_version = 0
    bus_no = 0x57
    dev_no = 0
    func_no = 0
    barnum = 0
    offset = 0xe58
    acc_attr = 0x2
    padding1 = 0
    data = 0
    status = 0
    padding2 = 0
    phys_addr = 0
}



Hope this helps?
Dan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-02-22 19:38       ` Dan McDonald
@ 2024-03-07  7:25         ` Dan McDonald
  2024-03-07  7:58           ` Pramod Batni
  0 siblings, 1 reply; 14+ messages in thread
From: Dan McDonald @ 2024-03-07  7:25 UTC (permalink / raw)
  To: illumos-developer

UPDATE...

I ran smartos-test again on the NUC (because igc(4D) is now in, and its actual hardware lets all of the BHYVE tests pass). It still panics when it saves the config space to /tmp/.

If I save the config space to /var/tmp (i.e. not tmpfs) it works without issue (or panic).

I wonder what about tmpfs makes bad things happen?  I have multiple kernel dumps and can send them along to anyone who's interested.  They pair up nicely with this week's SmartOS release (20240307).  I also have the successful save-cfgspace directory in /var/tmp.

Here's a *kernel* stack from the pcieadm command.  Looks like VM and cross-calls are in its stack:

> fffffe58fa48f180::ps -tf
S PID    PPID   PGID   SID    UID    FLAGS      ADDR             NAME
R 41904  41882  6717   5533   0      0x4a004000 fffffe58fa48f180 /usr/lib/pci/pcieadm save-cfgspace -a /tmp/pcieadm-priv.41882
        T  0xfffffe5953674040 <TS_ONPROC>
> 0xfffffe5953674040::findstack -v
stack pointer for thread fffffe5953674040 (pcieadm/1): fffffe007c50b4e0
  fffffe007c50b510 apix_setspl+0x22(fffffffff38bd231)
  fffffe007c50b550 do_splx+0x84(b)
  fffffe007c50b5e0 xc_common+0x249(fffffffffb831950, fffffe58b99cb3f8, fffffe007c50b648, 0, fffffe007c50b680, 2)
  fffffe007c50b630 xc_call+0x3d(fffffe58b99cb3f8, fffffe007c50b648, 0, fffffe007c50b680, fffffffffb831950)
  fffffe007c50b6f0 hat_tlb_inval_range+0x28d(ff17a2, fffffe007c50b708)
  fffffe007c50b740 x86pte_mapin+0x42(ff17a2, 14c, fffffe58b9bfa5c0)
  fffffe007c50b760 x86pte_release_pagetable+0x1d(fffffe58b9bfa5c0)
  fffffe007c50b7e0 x86pte_set+0x108(fffffe58b9bfa5c0, 14c, 80000000c5700473, 0)
  fffffe007c50b880 hati_pte_map+0x449(f1, f, b, fffffe007c50b910, fffffe58cc600000, fffffe58b9bfa5c0)
  fffffe007c50b910 0xb()
  fffffe007c50ba20 pci_cfgacc_mmio+0xba(fffffe007c50ba78)
  fffffe007c50ba50 pci_cfgacc_acc+0x69(fffffe007c50ba78)
  fffffe007c50bae0 pcitool_cfg_access+0x180(fffffe007c50bb10, 0, 0)
  fffffe007c50bbc0 pcitool_dev_reg_ops+0x212(fffffe58b9b84688, fffffc7fffdfe4f0, 50435401, 202001)
  fffffe007c50bc30 pci_common_ioctl+0xc5(fffffe58b9b84688, 7d000000fd, 50435401, fffffc7fffdfe4f0, 202001, fffffe5953707af8, fffffe007c50be18)
  fffffe007c50bcc0 npe_ioctl+0xa5(7d000000fd, 50435401, fffffc7fffdfe4f0, 202001, fffffe5953707af8, fffffe007c50be18)
  fffffe007c50bd00 cdev_ioctl+0x3f(7d000000fd, 50435401, fffffc7fffdfe4f0, 202001, fffffe5953707af8, fffffe007c50be18)
  fffffe007c50bd50 spec_ioctl+0x55(fffffe58f9758640, 50435401, fffffc7fffdfe4f0, 202001, fffffe5953707af8, fffffe007c50be18, 0)
  fffffe007c50bde0 fop_ioctl+0x40(fffffe58f9758640, 50435401, fffffc7fffdfe4f0, 202001, fffffe5953707af8, fffffe007c50be18, 0)
  fffffe007c50bf00 ioctl+0x144(5, 50435401, fffffc7fffdfe4f0)
  fffffe007c50bf10 sys_syscall+0x1a8()
> fffffe58fa48f180::pfiles
FD   TYPE            VNODE INFO
   0 FIFO fffffe58fa073e40 
   1  CHR fffffe58d8e20d80 /devices/pseudo/mm@0:null 
   2 FIFO fffffe58fab4a800 
   3  DIR fffffe590be22d00 /tmp/pcieadm-priv.41882 
   4  REG fffffe5953702240 /tmp/pcieadm-priv.41882/57-00-00.pci 
   5  CHR fffffe58f9758640 /devices/pci@0,0:reg 
>

Dan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-03-07  7:25         ` Dan McDonald
@ 2024-03-07  7:58           ` Pramod Batni
  2024-03-07 16:04             ` Dan McDonald
  0 siblings, 1 reply; 14+ messages in thread
From: Pramod Batni @ 2024-03-07  7:58 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 3877 bytes --]

Since the panics are see when tmpfs is in use, is there a possibility that
there is not much  physical memory (RAM) available in the system?

What does the “ ::memstat” dcmd spit out
when executed on the crash dump?

Thanks
Pramod



On Thu, 7 Mar 2024 at 12:58, Dan McDonald <danmcd@mnx.io> wrote:

> UPDATE...
>
> I ran smartos-test again on the NUC (because igc(4D) is now in, and its
> actual hardware lets all of the BHYVE tests pass). It still panics when it
> saves the config space to /tmp/.
>
> If I save the config space to /var/tmp (i.e. not tmpfs) it works without
> issue (or panic).
>
> I wonder what about tmpfs makes bad things happen?  I have multiple kernel
> dumps and can send them along to anyone who's interested.  They pair up
> nicely with this week's SmartOS release (20240307).  I also have the
> successful save-cfgspace directory in /var/tmp.
>
> Here's a *kernel* stack from the pcieadm command.  Looks like VM and
> cross-calls are in its stack:
>
> > fffffe58fa48f180::ps -tf
> S PID    PPID   PGID   SID    UID    FLAGS      ADDR             NAME
> R 41904  41882  6717   5533   0      0x4a004000 fffffe58fa48f180
> /usr/lib/pci/pcieadm save-cfgspace -a /tmp/pcieadm-priv.41882
>         T  0xfffffe5953674040 <TS_ONPROC>
> > 0xfffffe5953674040::findstack -v
> stack pointer for thread fffffe5953674040 (pcieadm/1): fffffe007c50b4e0
>   fffffe007c50b510 apix_setspl+0x22(fffffffff38bd231)
>   fffffe007c50b550 do_splx+0x84(b)
>   fffffe007c50b5e0 xc_common+0x249(fffffffffb831950, fffffe58b99cb3f8,
> fffffe007c50b648, 0, fffffe007c50b680, 2)
>   fffffe007c50b630 xc_call+0x3d(fffffe58b99cb3f8, fffffe007c50b648, 0,
> fffffe007c50b680, fffffffffb831950)
>   fffffe007c50b6f0 hat_tlb_inval_range+0x28d(ff17a2, fffffe007c50b708)
>   fffffe007c50b740 x86pte_mapin+0x42(ff17a2, 14c, fffffe58b9bfa5c0)
>   fffffe007c50b760 x86pte_release_pagetable+0x1d(fffffe58b9bfa5c0)
>   fffffe007c50b7e0 x86pte_set+0x108(fffffe58b9bfa5c0, 14c,
> 80000000c5700473, 0)
>   fffffe007c50b880 hati_pte_map+0x449(f1, f, b, fffffe007c50b910,
> fffffe58cc600000, fffffe58b9bfa5c0)
>   fffffe007c50b910 0xb()
>   fffffe007c50ba20 pci_cfgacc_mmio+0xba(fffffe007c50ba78)
>   fffffe007c50ba50 pci_cfgacc_acc+0x69(fffffe007c50ba78)
>   fffffe007c50bae0 pcitool_cfg_access+0x180(fffffe007c50bb10, 0, 0)
>   fffffe007c50bbc0 pcitool_dev_reg_ops+0x212(fffffe58b9b84688,
> fffffc7fffdfe4f0, 50435401, 202001)
>   fffffe007c50bc30 pci_common_ioctl+0xc5(fffffe58b9b84688, 7d000000fd,
> 50435401, fffffc7fffdfe4f0, 202001, fffffe5953707af8, fffffe007c50be18)
>   fffffe007c50bcc0 npe_ioctl+0xa5(7d000000fd, 50435401, fffffc7fffdfe4f0,
> 202001, fffffe5953707af8, fffffe007c50be18)
>   fffffe007c50bd00 cdev_ioctl+0x3f(7d000000fd, 50435401, fffffc7fffdfe4f0,
> 202001, fffffe5953707af8, fffffe007c50be18)
>   fffffe007c50bd50 spec_ioctl+0x55(fffffe58f9758640, 50435401,
> fffffc7fffdfe4f0, 202001, fffffe5953707af8, fffffe007c50be18, 0)
>   fffffe007c50bde0 fop_ioctl+0x40(fffffe58f9758640, 50435401,
> fffffc7fffdfe4f0, 202001, fffffe5953707af8, fffffe007c50be18, 0)
>   fffffe007c50bf00 ioctl+0x144(5, 50435401, fffffc7fffdfe4f0)
>   fffffe007c50bf10 sys_syscall+0x1a8()
> > fffffe58fa48f180::pfiles
> FD   TYPE            VNODE INFO
>    0 FIFO fffffe58fa073e40
>    1  CHR fffffe58d8e20d80 /devices/pseudo/mm@0:null
>    2 FIFO fffffe58fab4a800
>    3  DIR fffffe590be22d00 /tmp/pcieadm-priv.41882
>    4  REG fffffe5953702240 /tmp/pcieadm-priv.41882/57-00-00.pci
>    5  CHR fffffe58f9758640 /devices/pci@0,0:reg
>
> Dan
>
>
> ------------------------------------------
> illumos: illumos-developer
> Permalink:
> https://illumos.topicbox.com/groups/developer/T32516b4f237e0abc-M31bf958348fbe3101fe29726
> Delivery options:
> https://illumos.topicbox.com/groups/developer/subscription
>

[-- Attachment #2: Type: text/html, Size: 4786 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-03-07  7:58           ` Pramod Batni
@ 2024-03-07 16:04             ` Dan McDonald
  2024-03-08  2:35               ` Pramod Batni
  0 siblings, 1 reply; 14+ messages in thread
From: Dan McDonald @ 2024-03-07 16:04 UTC (permalink / raw)
  To: illumos-developer

On Mar 7, 2024, at 2:58 AM, Pramod Batni <pramod.batni@gmail.com> wrote:
> 
> Since the panics are see when tmpfs is in use, is there a possibility that there is not much  physical memory (RAM) available in the system?

Nope.  It's got 64GB RAM.

>    Is there a system into which I can
>    login and look at these crash dumps?
>    (If so, can you please share the system
>     access details) 

Remote login, I can't.  But I *can* share a vmdump.N file if you wish for download.  The smallest one I have is 556MiB, which will expand to 3-ish GiB after `savecore -f`.  Best paired with this week's SmartOS release.

Dan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-03-07 16:04             ` Dan McDonald
@ 2024-03-08  2:35               ` Pramod Batni
  2024-03-08  4:40                 ` Dan McDonald
  0 siblings, 1 reply; 14+ messages in thread
From: Pramod Batni @ 2024-03-08  2:35 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 1512 bytes --]

On Thu, 7 Mar 2024 at 21:37, Dan McDonald <danmcd@mnx.io> wrote:

> On Mar 7, 2024, at 2:58 AM, Pramod Batni <pramod.batni@gmail.com> wrote:
> >
> > Since the panics are see when tmpfs is in use, is there a possibility
> that there is not much  physical memory (RAM) available in the system?
>
> Nope.  It's got 64GB RAM.

Ok.
Sorry, I meant if at the point of time of the panic the physical memory was
low and that led
to the failure. Probably not, but just wanted to check..
BTW, what is the panic message?
Your first email on this thread mentions a panic due to PCI error
panic message: pcieb-2: PCI(-X) Express Fatal Error. (0x43)

Is that resolved?

What is the panic message of the subsequent (latest) crashes?

The stack of the  "/usr/lib/pci/pcieadm save-cfgspace -a
/tmp/pcieadm-priv.41882 " process shows a call to pci_cfgacc_mmio which
causes a page to be mapped in the HAT via hati_pte_map() function which
subsequently leads the x-calls.

>
> >    Is there a system into which I can
> >    login and look at these crash dumps?
> >    (If so, can you please share the system
> >     access details)
>
> Remote login, I can't.  But I *can* share a vmdump.N file if you wish for
> download.  The smallest one I have is 556MiB, which will expand to 3-ish
> GiB after `savecore -f`.  Best paired with this week's SmartOS release.
>

Ok. Thanks. I shall try to setup a vm running the latest SmartOS release to
analyse the crash files.

Pramod

>
>
>

[-- Attachment #2: Type: text/html, Size: 2701 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-03-08  2:35               ` Pramod Batni
@ 2024-03-08  4:40                 ` Dan McDonald
  2024-03-11  6:38                   ` Pramod Batni
  0 siblings, 1 reply; 14+ messages in thread
From: Dan McDonald @ 2024-03-08  4:40 UTC (permalink / raw)
  To: illumos-developer

On Mar 7, 2024, at 9:35 PM, Pramod Batni <pramod.batni@gmail.com> wrote:

> BTW, what is the panic message?  
> Your first email on this thread mentions a panic due to PCI error
> panic message: pcieb-2: PCI(-X) Express Fatal Error. (0x43) 
> 
> Is that resolved?

I think?  I uttered `fmadm repair` to it and FMA stopped complaining.  But that's all, I can reproduce it w/o involving /tmp...

> What is the panic message of the subsequent (latest) crashes?

Same as the one above.

> The stack of the  "/usr/lib/pci/pcieadm save-cfgspace -a /tmp/pcieadm-priv.41882 " process shows a call to pci_cfgacc_mmio which
> causes a page to be mapped in the HAT via hati_pte_map() function which subsequently leads the x-calls.

I uttered this (show the cfgspace for the known-iffy component):

	/usr/lib/pci/pcieadm show-cfgspace -d 57/0/0

and it printed this:

Device 57/0/0 -- Type 0 Header
  Vendor ID: 0x17a0 -- Genesys Logic, Inc
  Device ID: 0x9755 -- GL9755 SD Host Controller
  Command: 0x46
    |--> I/O Space: disabled (0x0)
    |--> Memory Space: enabled (0x2)
    |--> Bus Master: enabled (0x4)
    |--> Special Cycle: disabled (0x0)
    |--> Memory Write and Invalidate: disabled (0x0)
    |--> VGA Palette Snoop: disabled (0x0)
    |--> Parity Error Response: enabled (0x40)
    |--> IDSEL Stepping/Wait Cycle Control: unsupported (0x0)
    |--> SERR# Enable: disabled (0x0)
    |--> Fast Back-to-Back Transactions: disabled (0x0)
    |--> Interrupt X: enabled (0x0)
  Status: 0x10
    |--> Immediate Readiness: unsupported (0x0)
    |--> Interrupt Status: not pending (0x0)
    |--> Capabilities List: supported (0x10)
    |--> 66 MHz Capable: unsupported (0x0)
    |--> Fast Back-to-Back Capable: unsupported (0x0)
    |--> Master Data Parity Error: no error (0x0)
    |--> DEVSEL# Timing: fast (0x0)
    |--> Signaled Target Abort: no (0x0)
    |--> Received Target Abort: no (0x0)
    |--> Received Master Abort: no (0x0)
    |--> Signaled System Error: no (0x0)
    |--> Detected Parity Error: no (0x0)
  Revision ID: 0x0
  Class Code: 0x80501
    |--> Class Code: 0x8
    |--> Sub-Class Code: 0xa
    |--> Programming Interface: 0x1
  Cache Line Size: 0x40 bytes
  Latency Timer: 0x0 cycles
  Header Type: 0x0
    |--> Header Layout: Device (0x0)
    |--> Multi-Function Device: no (0x0)
  BIST: 0x0
    |--> Completion Code: 0x0
    |--> Start BIST: 0x0
    |--> BIST Capable: unsupported (0x0)
  Base Address Register 0
    |--> Space: Memory Space (0x0)
    |--> Address: 0x6a400000
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Base Address Register 1
    |--> Space: Memory Space (0x0)
    |--> Address: 0x0
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Base Address Register 2
    |--> Space: Memory Space (0x0)
    |--> Address: 0x0
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Base Address Register 3
    |--> Space: Memory Space (0x0)
    |--> Address: 0x0
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Base Address Register 4
    |--> Space: Memory Space (0x0)
    |--> Address: 0x0
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Base Address Register 5
    |--> Space: Memory Space (0x0)
    |--> Address: 0x0
    |--> Memory Type: 32-bit (0x0)
    |--> Prefetchable: no (0x0)
  Cardbus CIS Pointer: 0x0
  Subsystem Vendor ID: 0x8086 -- Intel Corporation
  Subsystem Device ID: 0x3004
  Expansion ROM: 0x0
    |--> Enable: disabled (0x0)
    |--> Validation Status: not supported (0x0)
    |--> Validation Details: 0x0
    |--> Base Address: 0x0
  Capabilities Pointer: 0x80
  Interrupt Line: 0xff
  Interrupt Pin: 0x1 -- INTA
  Min_Gnt: 0x0
  Min_Lat: 0x0

But then the kernel panicked again, so I have a fourth dump (vmdump.3).

Here's some goodies from mdb on *.3:

> ::status
debugging crash dump vmcore.3 (64-bit) from nuc
operating system: 5.11 joyent_20240307T000552Z (i86pc)
git branch: release-20240307
git rev: ca5416a0f74ef8b8c072cb3b0e3f30ec5feaa30d
image uuid: (not set)
panic message: pcieb-2: PCI(-X) Express Fatal Error. (0x43)
dump content: kernel pages only
> 0xfffffe58dfb4fba0::findstack -v
stack pointer for thread fffffe58dfb4fba0 (pcieadm/1): fffffe007a0424e0
  fffffe007a042510 apix_setspl+0x22(fffffffff38bd231)
  fffffe007a042550 do_splx+0x84(b)
  fffffe007a0425e0 xc_common+0x249(fffffffffb831950, fffffe58b99cb3f8, fffffe007a042648, 0, fffffe007a042680, 2)
  fffffe007a042630 xc_call+0x3d(fffffe58b99cb3f8, fffffe007a042648, 0, fffffe007a042680, fffffffffb831950)
  fffffe007a0426f0 hat_tlb_inval_range+0x28d(ff17a2, fffffe007a042708)
  fffffe007a042740 x86pte_mapin+0x42(ff17a2, 14c, fffffe58b9bfa5c0)
  fffffe007a042760 x86pte_release_pagetable+0x1d(fffffe58b9bfa5c0)
  fffffe007a0427e0 x86pte_set+0x108(fffffe58b9bfa5c0, 14c, 80000000c5700473, 0)
  fffffe007a042880 hati_pte_map+0x449(f1, f, b, fffffe007a042910, fffffe58cba21000, fffffe58b9bfa5c0)
  fffffe007a042910 0xb()
  fffffe007a042a20 pci_cfgacc_mmio+0xba(fffffe007a042a78)
  fffffe007a042a50 pci_cfgacc_acc+0x69(fffffe007a042a78)
  fffffe007a042ae0 pcitool_cfg_access+0x180(fffffe007a042b10, 0, 0)
  fffffe007a042bc0 pcitool_dev_reg_ops+0x212(fffffe58b9b84688, fffffc7fffdfe7b0, 50435401, 202001)
  fffffe007a042c30 pci_common_ioctl+0xc5(fffffe58b9b84688, 7d000000fd, 50435401, fffffc7fffdfe7b0, 202001, fffffe58c8965af8, fffffe007a042e18)
  fffffe007a042cc0 npe_ioctl+0xa5(7d000000fd, 50435401, fffffc7fffdfe7b0, 202001, fffffe58c8965af8, fffffe007a042e18)
  fffffe007a042d00 cdev_ioctl+0x3f(7d000000fd, 50435401, fffffc7fffdfe7b0, 202001, fffffe58c8965af8, fffffe007a042e18)
  fffffe007a042d50 spec_ioctl+0x55(fffffe58e443f340, 50435401, fffffc7fffdfe7b0, 202001, fffffe58c8965af8, fffffe007a042e18, 0)
  fffffe007a042de0 fop_ioctl+0x40(fffffe58e443f340, 50435401, fffffc7fffdfe7b0, 202001, fffffe58c8965af8, fffffe007a042e18, 0)
  fffffe007a042f00 ioctl+0x144(3, 50435401, fffffc7fffdfe7b0)
  fffffe007a042f10 sys_syscall+0x1a8()
> $C
fffffe0079d01ae0 vpanic()
fffffe0079d01b80 pcieb_intr_handler+0x2aa(fffffe58c86633c0, 0)
fffffe0079d01bd0 apix_dispatch_by_vector+0x8c(20)
fffffe0079d01c00 apix_dispatch_lowlevel+0x29(20, 0)
fffffe0079caaa50 switch_sp_and_call+0x15()
fffffe0079caaab0 apix_do_interrupt+0xf3(fffffe0079caaac0, 0)
fffffe0079caaac0 _interrupt+0xc3()
fffffe0079caabb0 i86_mwait+0x12()
fffffe0079caabe0 cpu_idle_mwait+0x14b()
fffffe0079caac00 idle+0xa8()
fffffe0079caac10 thread_start+0xb()
> 

> Ok. Thanks. I shall try to setup a vm running the latest SmartOS release to analyse the crash files.

Say the word and I'll put the dumps up on kebe.com <http://kebe.com/> somewhere.

Thanks,
Dan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-03-08  4:40                 ` Dan McDonald
@ 2024-03-11  6:38                   ` Pramod Batni
  2024-03-13 17:21                     ` Dan McDonald
  0 siblings, 1 reply; 14+ messages in thread
From: Pramod Batni @ 2024-03-11  6:38 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 525 bytes --]

>
> > Ok. Thanks. I shall try to setup a vm running the latest SmartOS release
> to analyse the crash files.
>
> Say the word and I'll put the dumps up on kebe.com <http://kebe.com/>
> somewhere.
>

   Please. :-)



>
> Thanks,
> Dan
>
>
> ------------------------------------------
> illumos: illumos-developer
> Permalink:
> https://illumos.topicbox.com/groups/developer/T32516b4f237e0abc-Ma74ff7ee245d231d7a6fa379
> Delivery options:
> https://illumos.topicbox.com/groups/developer/subscription
>

[-- Attachment #2: Type: text/html, Size: 1326 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-03-11  6:38                   ` Pramod Batni
@ 2024-03-13 17:21                     ` Dan McDonald
  2024-03-29 13:40                       ` Pramod Batni
  0 siblings, 1 reply; 14+ messages in thread
From: Dan McDonald @ 2024-03-13 17:21 UTC (permalink / raw)
  To: illumos-developer

Sorry for the latency on this.  Exchange or Mail.app put your last email in junk.

	https://kebe.com/~danmcd/webrevs/pcie-vmdump.3 <https://kebe.com/~danmcd/webrevs/pcie-vmdump.3>

MD5 == 1036261a0c22e5f213e07e4b317b9001

Dan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [developer] Intel NUC (Gen 11) panic during pcieadm illumos testing
  2024-03-13 17:21                     ` Dan McDonald
@ 2024-03-29 13:40                       ` Pramod Batni
  0 siblings, 0 replies; 14+ messages in thread
From: Pramod Batni @ 2024-03-29 13:40 UTC (permalink / raw)
  To: illumos-developer

[-- Attachment #1: Type: text/plain, Size: 1858 bytes --]

I had a look at the crash dump file.


Looking into the invocation of the pcieadm command whose thread initiated
the panic in the dump

> ::ps ! grep pciead
R 7390   7353   7390   7353   0      0x4a004000 fffffe58cc671218 pcieadm
> fffffe58cc671218::print proc_t ! grep psar
        u_psargs = [ "/usr/lib/pci/pcieadm show-cfgspace -d 57/0/0" ]
>
>

> fffffe58cc671218::pfiles
FD   TYPE            VNODE INFO
   0  CHR fffffe58fc5aca80 /dev/pts/2
   1  CHR fffffe58fc5aca80 /dev/pts/2
   2  CHR fffffe58fc5aca80 /dev/pts/2
   3  CHR fffffe58e443f340 /devices/pci@0,0:reg
>

 Neither the arguments to the "pcieadm" command nor the fd's associated
with the pcieadm process
has any reference to /tmp .

As you mentioned in one of the previous emails (about encountering the
panic w/o /tmp) , this panic did not involve /tmp.

--
> Your first email on this thread mentions a panic due to PCI error
> panic message: pcieb-2: PCI(-X) Express Fatal Error. (0x43)
>
> Is that resolved?

I think?  I uttered `fmadm repair` to it and FMA stopped complaining.  But
that's all, I can reproduce it w/o involving
/tmp...
---


I think  the panics triggered by pcieb_intr_handler are not non-pcie
related.


Pramod



On Wed, Mar 13, 2024 at 10:55 PM Dan McDonald <danmcd@mnx.io> wrote:

> Sorry for the latency on this.  Exchange or Mail.app put your last email
> in junk.
>
>         https://kebe.com/~danmcd/webrevs/pcie-vmdump.3 <
> https://kebe.com/~danmcd/webrevs/pcie-vmdump.3>
>
> MD5 == 1036261a0c22e5f213e07e4b317b9001
>
> Dan
>
>
> ------------------------------------------
> illumos: illumos-developer
> Permalink:
> https://illumos.topicbox.com/groups/developer/T32516b4f237e0abc-Mc84ee0001adef12d744a0dce
> Delivery options:
> https://illumos.topicbox.com/groups/developer/subscription
>

[-- Attachment #2: Type: text/html, Size: 4145 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-03-29 13:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-22 16:08 Intel NUC (Gen 11) panic during pcieadm illumos testing Dan McDonald
2024-02-22 18:16 ` [developer] " Dan McDonald
2024-02-22 18:47 ` Robert Mustacchi
2024-02-22 18:58   ` Dan McDonald
2024-02-22 19:17     ` Robert Mustacchi
2024-02-22 19:38       ` Dan McDonald
2024-03-07  7:25         ` Dan McDonald
2024-03-07  7:58           ` Pramod Batni
2024-03-07 16:04             ` Dan McDonald
2024-03-08  2:35               ` Pramod Batni
2024-03-08  4:40                 ` Dan McDonald
2024-03-11  6:38                   ` Pramod Batni
2024-03-13 17:21                     ` Dan McDonald
2024-03-29 13:40                       ` Pramod Batni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).