Re: [PATCH v2] sched_ext: Document task ownership state machine

From: Kuba Piecuch

Date: Fri Mar 20 2026 - 09:56:16 EST


Hi Andrea,

Sorry for the late reply, I'm catching up with mail from the past ~month.

On Thu Mar 5, 2026 at 6:29 AM UTC, Andrea Righi wrote:
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index bd26811fea99d..417d3c6f02fe3 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -1042,26 +1042,108 @@ static const char *scx_enable_state_str[] = {
> };
>
> /*
> - * sched_ext_entity->ops_state
> + * Task Ownership State Machine (sched_ext_entity->ops_state)
> *
> - * Used to track the task ownership between the SCX core and the BPF scheduler.
> - * State transitions look as follows:
> + * The sched_ext core uses this state machine to track task ownership
> + * between the SCX core and the BPF scheduler. This allows the BPF
> + * scheduler to dispatch tasks without strict ordering requirements, while
> + * the SCX core safely rejects invalid dispatches.
> *
> - * NONE -> QUEUEING -> QUEUED -> DISPATCHING
> - * ^ | |
> - * | v v
> - * \-------------------------------/
> + * State Transitions
> *
> - * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call
> - * sites for explanations on the conditions being waited upon and why they are
> - * safe. Transitions out of them into NONE or QUEUED must store_release and the
> - * waiters should load_acquire.
> + * .------------> NONE (owned by SCX core)
> + * | | ^
> + * | enqueue | | direct dispatch
> + * | v |
> + * | QUEUEING -------'
> + * | |
> + * | enqueue |
> + * | completes |
> + * | v
> + * | QUEUED (owned by BPF scheduler)
> + * | |
> + * | dispatch |
> + * | |
> + * | v
> + * | DISPATCHING
> + * | |
> + * | dispatch |
> + * | completes |
> + * `---------------'

We can also go directly from QUEUED to NONE when a task is dequeued for an
attribute change or picked by core-sched.

> + * State Descriptions
> + *
> + * - %SCX_OPSS_NONE:
> + * Task is owned by the SCX core. It's either on a run queue, running,
> + * or being manipulated by the core scheduler. The BPF scheduler has no
> + * claim on this task.

A blocked task's ops_state is also NONE. Or are we assuming here that the task
is on_rq?
Also, a task waiting to run on a built-in DSQ is in NONE state as well.

> + *
> + * - %SCX_OPSS_QUEUEING:
> + * Transitional state while transferring a task from the SCX core to
> + * the BPF scheduler. The task's rq lock is held during this state.
> + * Since QUEUEING is both entered and exited under the rq lock, dequeue
> + * can never observe this state (it would be a BUG). When finishing a
> + * dispatch, if the task is still in %SCX_OPSS_QUEUEING the completion
> + * path busy-waits for it to leave this state (via wait_ops_state())
> + * before retrying.
> + *
> + * - %SCX_OPSS_QUEUED:
> + * Task is owned by the BPF scheduler. It's on a DSQ (dispatch queue)
> + * and the BPF scheduler is responsible for dispatching it. A QSEQ

The task doesn't have to be on a DSQ, it can be queued on some BPF data
structure instead. Even if it is on a DSQ, its state depends on whether it's
on a user DSQ (QUEUED) or a built-in DSQ, e.g. local (NONE).

This prompted me to have a look at the logic around SCX_OPSS_QUEUED and I can't
convince myself that it's correct in the case of direct dispatches to
non-builtin DSQs.

The only place where ops_state is set to QUEUED is at the end of
do_enqueue_task(). Notably, this assignment is skipped in the case of direct
dispatch.

direct_dispatch() will then call dispatch_enqueue() with SCX_ENQ_CLEAR_OPSS,
causing ops_state to be reset to NONE. We end up in a state where the task
is enqueued on a user DSQ, its ops_state is NONE and p->scx.flags has
SCX_TASK_IN_CUSTODY, which doesn't look like a consistent state to me.

Am I missing something here?

> + * (queue sequence number) is embedded in this state to detect
> + * dispatch/dequeue races: if a task is dequeued and re-enqueued, the
> + * QSEQ changes and any in-flight dispatch operations targeting the old
> + * QSEQ are safely ignored.

Technically speaking, the QSEQ is also embedded in QUEUEING, where it serves
the same purpose.

> + *
> + * - %SCX_OPSS_DISPATCHING:
> + * Transitional state while transferring a task from the BPF scheduler
> + * back to the SCX core. This state indicates the BPF scheduler has
> + * selected the task for execution. When dequeue needs to take the task

This description only applies to the case of a task being dispatched from
ops.dispatch().

There are cases where a task is transferred from the BPF scheduler to SCX core
without going through DISPATCHING, e.g. dequeue for attribute change or
core-sched pick.

This state indicates the BPF scheduler has selected the task for execution,
but it doesn't always go the other way: when the BPF scheduler selects a task
for execution via direct dispatch, at no point does the task enter the
DISPATCHING state.

> + * off a DSQ and it is still in %SCX_OPSS_DISPATCHING, the dequeue path
> + * busy-waits for it to leave this state (via wait_ops_state()) before
> + * proceeding. Exits to %SCX_OPSS_NONE when dispatch completes.
> + *
> + * Memory Ordering
> + *
> + * Transitions out of %SCX_OPSS_QUEUEING and %SCX_OPSS_DISPATCHING into
> + * %SCX_OPSS_NONE or %SCX_OPSS_QUEUED must use atomic_long_set_release()
> + * and waiters must use atomic_long_read_acquire(). This ensures proper
> + * synchronization between concurrent operations.

The transition from QUEUED to NONE in ops_dequeue() isn't covered here.
It uses atomic_long_try_cmpxchg(), which implies full ordering.

> + *
> + * Cross-CPU Task Migration
> + *
> + * When moving a task in the %SCX_OPSS_DISPATCHING state, we can't simply
> + * grab the target CPU's rq lock because a concurrent dequeue might be
> + * waiting on %SCX_OPSS_DISPATCHING while holding the source rq lock
> + * (deadlock).
> + *
> + * The sched_ext core uses a "lock dancing" protocol coordinated by
> + * p->scx.holding_cpu. When moving a task to a different rq:
> + *
> + * 1. Verify task can be moved (CPU affinity, migration_disabled, etc.)
> + * 2. Set p->scx.holding_cpu to the current CPU
> + * 3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
> + * is set, so clearing DISPATCHING first prevents the circular wait
> + * (safe to lock the rq we need)
> + * 4. Unlock the current CPU's rq
> + * 5. Lock src_rq (where the task currently lives)
> + * 6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
> + * race (dequeue clears holding_cpu to -1 when it takes the task), in
> + * this case migration is aborted
> + * 7. If src_rq == dst_rq: clear holding_cpu and enqueue directly
> + * into dst_rq's local DSQ (no lock swap needed)
> + * 8. Otherwise: call move_remote_task_to_local_dsq(), which releases
> + * src_rq, locks dst_rq, and performs the deactivate/activate
> + * migration cycle (dst_rq is held on return)
> + * 9. Unlock dst_rq and re-lock the current CPU's rq to restore
> + * the lock state expected by the caller
> + *
> + * If any verification fails, abort the migration.

Maybe also mention that the same dance happens during direct dispatch, where
the task's state at the beginning of the dance is already NONE
(set in direct_dispatch()), but src_rq is guaranteed to be equal to the current
CPU's rq?

> + *
> + * This state tracking allows the BPF scheduler to try to dispatch any task
> + * at any time regardless of its state. The SCX core can safely
> + * reject/ignore invalid dispatches, simplifying the BPF scheduler
> + * implementation.
> */
> enum scx_ops_state {
> SCX_OPSS_NONE, /* owned by the SCX core */

Thanks,
Kuba