Re: BUG: workqueue lockup - SRCU schedules work on not-online CPUs during size transition
From: Vasily Gorbik
Date: Wed Apr 29 2026 - 13:51:33 EST
On Tue, Apr 14, 2026 at 12:24:12PM -0700, Paul E. McKenney wrote:
> On Thu, Apr 09, 2026 at 09:03:26PM -0700, Paul E. McKenney wrote:
> Please see below for the full patch, including refraining from queueing
> workqueue handlers on not-yet-online CPUs and diverting SRCU callbacks
> from not-yet-fully-online CPUs to the boot CPU's callback queue.
...
> commit ce533a60b2ef29a9b516cc717e77c6b679bc09c0
> Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> Date: Thu Apr 9 11:16:02 2026 -0700
>
> srcu: Don't queue workqueue handlers to never-online CPUs
>
> While an srcu_struct structure is in the midst of switching from CPU-0
> to all-CPUs state, it can attempt to invoke callbacks for CPUs that
> have never been online. Worse yet, it can attempt in invoke callbacks
> for CPUs that never will be online due to not being present in the
> cpu_possible_mask. This can cause hangs on s390, which is not set up to
> deal with workqueue handlers being scheduled on such CPUs. This commit
> therefore causes Tree SRCU to refrain from queueing workqueue handlers
> on CPUs that have not yet (and might never) come online.
>
> Because callbacks are not invoked on CPUs that have not been
> online, it is an error to invoke call_srcu(), synchronize_srcu(), or
> synchronize_srcu_expedited() on a CPU that is not yet fully online.
> However, it turns out to be less code to redirect the callbacks
> from too-early invocations of call_srcu() than to warn about such
> invocations. This commit therefore also redirects callbacks queued on
> not-yet-fully-online CPUs to the boot CPU.
>
> Reported-by: Vasily Gorbik <gor@xxxxxxxxxxxxx>
> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> Tested-by: Vasily Gorbik <gor@xxxxxxxxxxxxx>
> Cc: Tejun Heo <tj@xxxxxxxxxx>
I retested it on s390 and on x86 KVM with --smp 16,maxcpus=255, all
looks good to me.
FWIW, again:
Tested-by: Vasily Gorbik <gor@xxxxxxxxxxxxx>
Would you mind adding Cc: stable so it gets picked up for v7.0?
61bbcfb50514 ("srcu: Push srcu_node allocation to GP when
non-preemptible") is what made it reproducible for us.
Thank you!