Re: [PATCH RFT RFC] usb: xhci: Kill hosts with HCE or HSE on command timeout

Next message: Hongling Zeng: "[PATCH v5 0/4] phy: ti-pipe3: Fix clock resource handling issues"
Previous message: Aneesh Kumar K . V: "Re: [PATCH v5 1/3] firmware: smccc: coco: Manage arm-smccc platform device and CCA auxiliary drivers"
Next in thread: Desnes Nunes: "Re: [PATCH RFT RFC] usb: xhci: Kill hosts with HCE or HSE on command timeout"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Michal Pecio

Date: Mon May 18 2026 - 02:36:41 EST

On Mon, 4 May 2026 09:31:18 +0200, Michal Pecio wrote:
> Never mind, here's the smoking gun:
>
> [...]
> [Fri May 1 09:46:41 2026] xhci_hcd 0000:80:14.0: // Turn on HC, cmd = 0x5.
> [Fri May 1 09:46:41 2026] DMAR: DRHD: handling fault status reg 2
> [Fri May 1 09:46:41 2026] DMAR: [DMA Read NO_PASID] Request device
> [80:14.0] fault addr 0x1001680000 [fault reason 0x39] SM: Present bit
> in Root Entry is clear
>
> The chip IOMMU faults shortly after setting USBCMD.RUN = 1.
> Such fault is expected to cause HSE assertion and usually it does.
> You will probably find that HSE is already set while Enable Slot
> is being queued, even if it was clear in xhci_gen_setup().
>
> 1001680000 is close to valid addresses like 100167e000 or 100167c000.
>
> Possible causes:
> - xHCI or IOMMU driver bug
> - HW corrupted a pointer
> - HW accessed something out of bounds
> - HW dereferenced a stale pointer from the original kernel
>
> Do you happen to have more of those logs saved, are they all like that?
> Any chance that 1001680000 appears somewhere in the main kernel's log?

Hi again,

I see a certain lack of interest in finding the root cause of this.

I have done a simple test on my own HW: writing bogus CRCR to cause
IOMMU fault when the first command is submitted. I found that not all
HCs reliably set HSE in this case, but obviously none of them ever
complete the command properly.

It seems that unconditional hc_died() on Enable Slot timeout may not be
a bad idea. Makes me wonder if the same shouldn't apply to all commands
besides Address Device, they typically only timeout due to HW issues.

Regards,
Michal