EXTERNAL EMAIL
On Fri, May 02, 2025 at 11:49:07PM +0800, Hans Zhang wrote:
On 2025/5/2 23:00, Bjorn Helgaas wrote:
EXTERNAL EMAIL
On Fri, May 02, 2025 at 11:20:51AM +0800, hans.zhang@xxxxxxxxxxx wrote:
From: Hans Zhang <hans.zhang@xxxxxxxxxxx>
When PCIe ASPM L1 is enabled (CONFIG_PCIEASPM_POWERSAVE=y), certain
CONFIG_PCIEASPM_POWERSAVE=y only sets the default. L1 can be enabled
dynamically regardless of the config.
Dear Bjorn,
Thank you very much for your reply.
Yes. To reduce the power consumption of the SOC system, we have enabled ASPM
L1 by default.
NVMe controllers fail to release LPI MSI-X interrupts during system
suspend, leading to a system hang. This occurs because the driver's
existing power management path does not fully disable the device
when ASPM is active.
I have no idea what this has to do with ASPM L1. I do see that
nvme_suspend() tests pcie_aspm_enabled(pdev) (which seems kind of
janky and racy). But this doesn't explain anything about what would
cause a system hang.
[ 92.411265] [pid:322,cpu11,kworker/u24:6]nvme 0000:91:00.0: PM: calling
pci_pm_suspend_noirq+0x0/0x2c0 @ 322, parent: 0000:90:00.0
[ 92.423028] [pid:322,cpu11,kworker/u24:6]nvme 0000:91:00.0: PM:
pci_pm_suspend_noirq+0x0/0x2c0 returned 0 after 1 usecs
[ 92.433894] [pid:324,cpu10,kworker/u24:7]pcieport 0000:90:00.0: PM:
calling pci_pm_suspend_noirq+0x0/0x2c0 @ 324, parent: pci0000:90
[ 92.445880] [pid:324,cpu10,kworker/u24:7]pcieport 0000:90:00.0: PM:
pci_pm_suspend_noirq+0x0/0x2c0 returned 0 after 39 usecs
[ 92.457227] [pid:916,cpu7,bash]sky1-pcie a070000.pcie: PM: calling
sky1_pcie_suspend_noirq+0x0/0x174 @ 916, parent: soc@0
[ 92.479315] [pid:916,cpu7,bash]cix-pcie-phy a080000.pcie_phy:
pcie_phy_common_exit end
[ 92.487389] [pid:916,cpu7,bash]sky1-pcie a070000.pcie:
sky1_pcie_suspend_noirq
[ 92.494604] [pid:916,cpu7,bash]sky1-pcie a070000.pcie: PM:
sky1_pcie_suspend_noirq+0x0/0x174 returned 0 after 26379 usecs
[ 92.505619] [pid:916,cpu7,bash]sky1-audss-clk
7110000.system-controller:clock-controller: PM: calling
genpd_suspend_noirq+0x0/0x80 @ 916, parent: 7110000.system-controller
[ 92.520919] [pid:916,cpu7,bash]sky1-audss-clk
7110000.system-controller:clock-controller: PM: genpd_suspend_noirq+0x0/0x80
returned 0 after 1 usecs
[ 92.534214] [pid:916,cpu7,bash]Disabling non-boot CPUs ...
Hans: Before I added the printk for debugging, it hung here.
I added the log output after debugging printk.
Sky1 SOC Root Port driver's suspend function: sky1_pcie_suspend_noirq
Our hardware is in STR(suspend to ram), and the controller and PHY will lose
power.
So in sky1_pcie_suspend_noirq, the AXI,APB clock, etc. of the PCIe
controller will be turned off. In sky1_pcie_resume_noirq, the PCIe
controller and PHY will be reinitialized. If suspend does not close the AXI
and APB clock, and the AXI is reopened during the resume process, the APB
clock will cause the reference count of the kernel API to accumulate
continuously.
So this is the actual issue (controller loosing power during system suspend) and
everything else (ASPM, MSIX write) are all side effects of it.
Yes, this issue is more common with several vendors and we need to come up with
a generic solution instead of hacking up the client drivers. I'm planning to
work on it in the coming days. Will keep you in the loop.