Re: [PATCH RFT RFC] usb: xhci: Kill hosts with HCE or HSE on command timeout
From: Michal Pecio
Date: Mon May 18 2026 - 02:36:41 EST
On Mon, 4 May 2026 09:31:18 +0200, Michal Pecio wrote:
> Never mind, here's the smoking gun:
>
> [...]
> [Fri May 1 09:46:41 2026] xhci_hcd 0000:80:14.0: // Turn on HC, cmd = 0x5.
> [Fri May 1 09:46:41 2026] DMAR: DRHD: handling fault status reg 2
> [Fri May 1 09:46:41 2026] DMAR: [DMA Read NO_PASID] Request device
> [80:14.0] fault addr 0x1001680000 [fault reason 0x39] SM: Present bit
> in Root Entry is clear
>
> The chip IOMMU faults shortly after setting USBCMD.RUN = 1.
> Such fault is expected to cause HSE assertion and usually it does.
> You will probably find that HSE is already set while Enable Slot
> is being queued, even if it was clear in xhci_gen_setup().
>
> 1001680000 is close to valid addresses like 100167e000 or 100167c000.
>
> Possible causes:
> - xHCI or IOMMU driver bug
> - HW corrupted a pointer
> - HW accessed something out of bounds
> - HW dereferenced a stale pointer from the original kernel
>
> Do you happen to have more of those logs saved, are they all like that?
> Any chance that 1001680000 appears somewhere in the main kernel's log?
Hi again,
I see a certain lack of interest in finding the root cause of this.
I have done a simple test on my own HW: writing bogus CRCR to cause
IOMMU fault when the first command is submitted. I found that not all
HCs reliably set HSE in this case, but obviously none of them ever
complete the command properly.
It seems that unconditional hc_died() on Enable Slot timeout may not be
a bad idea. Makes me wonder if the same shouldn't apply to all commands
besides Address Device, they typically only timeout due to HW issues.
Regards,
Michal