Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes

From: Andrea Righi

Date: Thu Mar 19 2026 - 15:04:45 EST

Hi Kuba,

On Thu, Mar 19, 2026 at 03:18:38PM +0000, Kuba Piecuch wrote:
> Hi Andrea,
>
> On Thu Mar 19, 2026 at 8:35 AM UTC, Andrea Righi wrote:
> > A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
> > target CPU. However, task affinity can change between the dispatch
> > decision and its finalization in finish_dispatch(). When this happens,
> > the scheduler may attempt to dispatch a task to a CPU that is no longer
> > allowed, resulting in fatal errors such as:
> >
> > EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])
> >
> > This race exists because ops.dispatch() runs without holding the task's
> > run queue lock, allowing a concurrent set_cpus_allowed() to update
> > p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
> > then finalized using stale affinity information.
> >
> > Example timeline:
> >
> > CPU0 CPU1
> > ---- ----
> > task_rq_lock(p)
> > if (cpumask_test_cpu(cpu, p->cpus_ptr))
> > set_cpus_allowed_scx(p, new_mask)
> > task_rq_unlock(p)
> > scx_bpf_dsq_insert(p,
> > SCX_DSQ_LOCAL_ON | cpu, 0)
> >
> > With commit ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics"), BPF
> > schedulers can avoid the affinity race by tracking task state and
> > handling %SCX_DEQ_SCHED_CHANGE in ops.dequeue(): when a task is dequeued
> > due to a property change, the scheduler can update the task state and
> > skip the direct dispatch from ops.dispatch() for non-queued tasks.
> >
> > However, schedulers that do not implement task state tracking and
> > dispatch directly to a local DSQ directly from ops.dispatch() may
> > trigger the scx_error() condition when the kernel validates the
> > destination in dispatch_to_local_dsq().
>
> The two paragraphs above mention "direct dispatch from ops.dispatch()"
> and "dispatch directly to a local DSQ directly from ops.dispatch()".
> My understanding is that a "direct dispatch" can only happen from
> ops.select_cpu() or ops.enqueue(), not from ops.dispatch(). Is this just
> an unfortunate choice of words?
> Would "dispatch to a local DSQ" be a more accurate phrase here?

Oh yes, poor wording on my side. What I mean is
scx_bpf_dsq_insert(SCX_DSQ_LOCAL_ON | cpu) from ops.dispatch(), so
"dispatch to a local DSQ" is definitely better, thanks!

-Andrea