Re: [PATCH v3 3/3] PCI: dwc: Enable MSI affinity support

From: Brian Norris

Date: Fri May 29 2026 - 20:35:52 EST

Hi Radu,

On Mon, May 25, 2026 at 12:48:09PM -0400, Radu Rendec wrote:
> On Fri, 2026-05-22 at 17:07 -0700, Brian Norris wrote:
> > (Updating Radu's email; dropping another bouncing email)
>
> Thanks for doing that! Obviously, I no longer have access to the email
> address that was used to post the patch, and I was lazy in setting up
> scripts that follow the mailing lists to catch messages that are
> addressed to me directly but using old email addresses.

No worries. The email bounce notified me rather quickly, and in the end,
you saw this within a few hours of first report anyway :)

> > On Fri, May 22, 2026 at 01:27:43PM -0700, Brian Norris wrote:
> > > I'll see if I can learn anything more here on my own, but I figured I'd
> > > report it in case you have any thoughts or leads I should investigate.
>
> Thanks for reporting it! I do not have any thoughts or leads yet, but I
> do plan to look at it during the next few days and hopefully come up
> with something. I also apologize for the slowness in my replies.

I'm only getting occasional time to spend on this too, so I'm a bit slow
as well.

> > In an hour or two of poking, all I've learned so far is that the problem
> > also seems to go away if I:
> >
> > (a) add a few dump_stack() and other noisy logs to a few key places (for
> >     now, __pci_write_msi_msg(), pci_power_up() failures, and
> >     irq_chip_redirect_set_affinity() -- I think __pci_write_msi_msg()
> >     was the most significant, possibly because it produced the most log
> >     text) and
> >
> > (b) leave a 115200 baud UART kernel console running.
> >
> > (This is on a sample size of 20+ suspend cycles, whereas previous
> > bisection would fail 100%.)
> >
> > It then reappers when I quiet the kernel logging a bit with `dmesg -n3`.
> >
> > I think that simply tells me that there's some timing issue or race
> > condition involved.
>
> That's very useful! Interrupts are migrated on suspend to the main CPU
> and then migrated back on resume, and the ordering and synchronization
> around that is tricky. The stack trace in your previous message tells
> me that the nvme driver is waiting for IO completion, which is normally
> signaled by an interrupt, except that interrupt never arrives.

That's true. But the first failure is:

nvme 0001:01:00.0: Unable to change power state from unknown to D0, device inaccessible

That means the PCI config read of the PM status register is returning
all-0xff, so we're not really able to guarantee the NVMe PCIe device has
powered back up at all. In my experience, that's indicative that the
PCIe link has failed in some way, or the root complex is otherwise
misbehaving. If the link is not functional, we won't receive any NVMe
MSIs.

I still can't explain what about this patch is causing a PCIe link
failure though.

> With my patch included, the demultiplexed interrupt (the nvme interrupt
> in this case) has an opportunity to be migrated during suspend/resume,
> whereas previously it did not. That's one more moving part, and I'll
> have to look closer at the code and think what could go wrong. I agree
> it's likely a race condition or a timing issue because it works with
> that extra logging, which adds small delays as a side effect.

I also tried hot-unplugging all non-boot CPUs before suspending the
system:

for i in /sys/devices/system/cpu/cpu{1,2,3,4,5,6,7}/online; do echo 1 > $i; done
echo +10 > /sys/class/rtc/rtc0/wakealarm
echo mem > /sys/power/state

I believe that means all the affinity/migration will occur while the
system is fully online, so we're less likely to run into power
management race conditions. In this test, NVMe is still functional after
the first step (CPU offline), but it still fails after suspend-to-mem.

This seems to tell me the irq_set_affinity()/migration process isn't
really what's killing things, but something else.

Brian