Re: [PATCH] net: wwan: t7xx: fix race between TX thread and system PM suspend

From: Paolo Abeni

Date: Thu May 21 2026 - 06:57:28 EST


On 5/18/26 9:50 AM, Tim JH Chen wrote:
> When system suspend is triggered while the DPMAIF TX kthread
> (t7xx_dpmaif_tx_hw_push_thread) is running, a deadlock can occur
> leading to a CPU soft lockup.
>
> The root cause is two-fold:
>
> 1. t7xx_dpmaif_suspend() calls t7xx_dpmaif_tx_stop() which only stops
> the TX work-queue items (by clearing txq->que_started and waiting on
> txq->tx_processing). It does NOT signal the kthread and does NOT
> update dpmaif_ctrl->state, which stays DPMAIF_STATE_PWRON.
>
> 2. The kthread's state guard (line: "if ... state != DPMAIF_STATE_PWRON")
> is only checked at the top of each loop iteration. If the thread
> already passed this guard, it proceeds unconditionally to call
> pm_runtime_resume_and_get() — which tries to acquire the PM spinlock
> also held (or contended) by the system PM suspend path.
>
> The result is a spinlock deadlock observed as:
>
> watchdog: BUG: soft lockup - CPU#N stuck for 26s! [dpmaif_tx_hw_pu]
> RIP: _raw_spin_unlock_irqrestore
> Call Trace:
> __pm_runtime_resume+0x5b/0x80
> t7xx_dpmaif_tx_hw_push_thread+0xc4 [mtk_t7xx]
>
> The condition requires ASPM L1 enabled on the endpoint (which extends
> the time pm_runtime_resume_and_get() holds the PM lock during L1.2
> link retraining) and hundreds of repeated suspend/resume cycles to
> trigger reliably.
>
> Fix by three coordinated changes:
>
> - In t7xx_dpmaif_suspend(): immediately set state to DPMAIF_STATE_PWROFF
> after stopping the TX queue, then call wake_up() so any sleeping thread
> re-evaluates the wait_event condition and stops.
>
> - In t7xx_dpmaif_resume(): restore state to DPMAIF_STATE_PWRON before
> re-enabling the TX queues, symmetric with the suspend change.
> Without this the kthread would never wake up after resume.
>
> - In t7xx_dpmaif_tx_hw_push_thread(): add a second state check
> immediately before pm_runtime_resume_and_get() to close the TOCTOU
> window between the wait_event guard and the pm call.
>
> Tested: no soft lockup observed over 500+ suspend/resume cycles with
> SIM registered and ASPM L1 enabled (previously triggered in < 300).
>
> Signed-off-by: Tim JH Chen <tim.jh.chen@xxxxxxxxxx>

This is a fix, it should target the 'net' tree including such tag into
the subj prefix and should carry a 'Fixes:' tag.

Also this is v2 of:

https://lore.kernel.org/netdev/TYZPR02MB5232A8C6A2BA56226D97CF4A90062@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

the subj prefix should have included the relevant revision number and
you should have described what changed in the commit message after a
'---' separator.

Please have a deep read at the process documentation and specifically at:

Documentation/process/maintainer-netdev.rst

before posting the next revision.

/P