Re: [REGRESSION] sched/deadline: Hard lockup during CPU offline after commit 14a857056466
From: juri.lelli@xxxxxxxxxx
Date: Mon May 18 2026 - 03:49:09 EST
Hello,
On 16/05/26 03:07, batcain wrote:
> [1.] One line summary of the problem:sched/deadline: Hard lockup
> during CPU offline/migration due to frozen rq_clock loop in
> update_dl_revised_wakeup()
>
> [2.] Full description of the problem/report: A deterministic hard
> lockup occurs during CPU hotplug (offlining a secondary core) on
> stable kernels containing commit
> 14a857056466be9d3d907a94e92a704ac1be149b.
>
> When a CPU core is set offline, tasks are migrated within the
> stop_machine() context where local interrupts are fully disabled
> (irqs_disabled()). During task migration, enqueue_task_dl() calls
> update_dl_entity(). Because of the new dl_defer rule introduced for
> implicit dl_servers, the code is forced into the
> update_dl_revised_wakeup() branch.
>
> Inside update_dl_revised_wakeup(), the logic depends on rq_clock(rq)
> to calculate laxity: u64 laxity = dl_se->deadline - rq_clock(rq);
>
> However, under the stop_machine() noirq phase, the runqueue clock is
> stale/frozen. Since the clock does not progress across iterations
> within the enqueue loop, the mathematical state stalls. Consequently,
> dl_entity_overflow() continuously evaluates to true, trapping the
> processor core in an infinite loop inside the enqueue path, resulting
> in a system-wide hard lockup.
I cannot immediately see how this issue can affect dl-server(s), as they
cannot migrate and are de-activated on CPUs going offline.
> [3.] Keywords (keywords of the affected subsystem): sched, deadline,
> dl_server, cpuhp, hotplug, hard-lockup, regression
>
> [4.] Kernel information (output of "uname -a" or version): Linux
> workstation 7.0.7-hardened2-1-hardened #1 SMP PREEMPT_DYNAMIC Fri, 15
> May 2026 00:03:13 +0000 x86_64 GNU/Linux
>
> [5.] Most recent kernel version which did not have the bug: Any kernel
> release prior to the integration/backport of commit 14a857056466.
>
> [6.] Output of Oops/Panic/Bug/Objdump: No native kernel oops/panic
> stack trace is written to disk/serial because the freeze occurs inside
> stop_machine() with interrupts masked. NMI watchdog triggers a hard
> lockup panic if aggressively armed.
>
> [7.] A small program which triggers the problem: # echo 0 >
> /sys/devices/system/cpu/cpu1/online
>
> [8.] Environment description (Hardware, distribution, etc.): Hardware:
> Confirmed on both AMD Zen 2 (Renoir) and AMD Zen 4 (Phoenix)
> platforms. Distribution: Arch Linux (using official
> extra/linux-hardened kernel package).
Also cannot reproduce at my end.
Thanks,
Juri