Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting

From: Andrea Righi

Date: Fri May 08 2026 - 10:14:54 EST

Hi Christian,

On Thu, May 07, 2026 at 02:56:42PM +0100, Christian Loehle wrote:
> When aborting, consume_dispatch_q() breaks out of the task iteration
> loop entirely for non-bypass DSQs. This prevents CPUs from consuming
> even their own tasks (where rq == task_rq) from any DSQ.
>
> This causes a deadlock during CPU hotplug:
>
> 1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(),
> setting sch->aborting and queuing the disable_work on the helper
> kthread.
>
> 2. The helper kthread (and other tasks) are stuck on the global or
> user DSQs because bypass mode hasn't been entered yet.
>
> 3. No CPU can consume these tasks due to the aborting break, so the
> helper never runs scx_root_disable() -> scx_bypass().
>
> 4. The cpuhp thread is stuck in balance_hotplug_wait() because the
> dying CPU's rq never drains.
>
> Tasks on user DSQs are equally affected: BPF schedulers can dispatch
> RCU and other critical kthreads to user DSQs, causing RCU stalls when
> those tasks become unconsumable.
>
> The aborting check was added to prevent live-locks from the remote task
> migration path (consume_remote_task() -> goto retry), but also avoid
> holding the dsq->lock for too long.
>
> Change the break to skip only remote tasks via continue, allowing each
> CPU to still consume tasks already on its own rq. This unblocks the
> helper kthread, lets bypass mode activate, and allows both hotplug and
> RCU grace periods to complete.

Have you been able to reproduce this stall condition?

When the kernel forces bypass, scx_bypass() explicitly walks every CPU's
runnable_list and cycles tasks through DEQUEUE_SAVE | DEQUEUE_MOVE so
dispatching stops depending on BPF.

On CPU hotplug the helper kthread (and all the other critical kthreads) should
be also in the runnable_list, so they should be moved to SCX_DSQ_BYPASS and
consume_dispatch_q() should be able to consume them.

Maybe the problem is that in do_enqueue_task() we keep tasks on the local DSQ
when !scx_rq_online(rq), instead we should prioritize the bypass condition.

Does something like the following make sense to you?

Thanks,
-Andrea

kernel/sched/ext.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7ac7d10a41bef..277110d950c30 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1901,6 +1901,17 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
*/
p->scx.flags &= ~SCX_TASK_IMMED;

+ /*
+ * Check bypass before testing the rq online state: bypass mode stops
+ * processing local DSQs, so tasks should be routed through
+ * SCX_DSQ_BYPASS rather than dispatched to the local DSQ during CPU
+ * hotplug events.
+ */
+ if (scx_bypassing(sch, cpu_of(rq))) {
+ __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
+ goto bypass;
+ }
+
/*
* If !scx_rq_online(), we already told the BPF scheduler that the CPU
* is offline and are just running the hotplug path. Don't bother the
@@ -1909,11 +1920,6 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
if (!scx_rq_online(rq))
goto local;

- if (scx_bypassing(sch, cpu_of(rq))) {
- __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
- goto bypass;
- }
-
if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
goto direct;