Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics

From: Thorsten Leemhuis

Date: Wed May 13 2026 - 03:33:36 EST


On 5/11/26 07:21, Jiri Slaby wrote:
> we currently have several reports of this. On s390, ppc64, and x86_64.

I stumbled on this by accident and this is not my area of expertise, so
the following might be bogus:

Is this maybe the same as "Observed Workqueue lockups on offline CPUs.":
https://lore.kernel.org/lkml/97a7d011-d573-4754-9e5d-68b562c64089@xxxxxxxxxxxxx/

Fix is here:
https://lore.kernel.org/lkml/20260508174353.905746-1-paulmck@xxxxxxxxxx/

Ciao, Thorsten

> On 07. 05. 26, 15:11, Breno Leitao wrote:
>> Hi Jiri,
>>
>> On Thu, May 07, 2026 at 12:20:33PM +0200, Jiri Slaby wrote:
>>> On 05. 03. 26, 17:15, Breno Leitao wrote:
>>>
>>>    BUG: workqueue lockup - pool cpus=144 node=0 flags=0x4 nice=0
>>> stuck for
>>> 168224s!
>>
>> That's an extremely long stall (~1.95 days).
>>
>>> ...
>>>    Showing busy workqueues and worker pools:
>>>    workqueue rcu_gp: flags=0x108
>>>      pwq 578: cpus=144 node=0 flags=0x4 nice=0 active=3 refcnt=4
>>> in:
>>>    https://bugzilla.suse.com/show_bug.cgi?id=1263947
>>> ?
>>>
>>> Can this (or other patch from the series) cause this? Should there be
>>> something like cpu_online() instead of task_is_running() somewhere?
>>
>> This series only affects stall reporting, not detection. The changes run
>> after the watchdog has identified a stall, so the detection logic itself
>> remains unchanged.
>>
>> To help diagnose this issue, could you provide some additional
>> information:
>>
>> 1) Was CPU 144 online at any point? If so, when was it taken offline?
>
> It was not, it's non-present.
>
>> 2) Does this message appear repeatedly? If you bring CPU 144 online, does
>>     the issue resolve?
>
> Yes, look at this new x86_64 report's dmesg (I believe it is related to
> the above report):
>   BUG: workqueue lockup - pool cpus=2 node=0 flags=0x4 nice=0 stuck for
> 50s!
> in:
>   https://bugzilla.suse.com/attachment.cgi?id=890229
>
> $ grep -c BUG sl.txt
> 504
> $ grep -c pwq sl.txt
> 509
>
> It comes from:
> https://bugzilla.suse.com/show_bug.cgi?id=1264554
>
>> 3) Have you run similar tests on earlier kernel versions without seeing
>>     this behavior, or is this a clear regression?
>
> It's new in 7.0. Going back to 6.19.12 makes it disappear.
>
> thanks,