Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
From: Vincent Guittot
Date: Fri Jun 06 2025 - 12:56:01 EST
On Fri, 6 Jun 2025 at 17:38, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, Jun 06, 2025 at 05:03:36PM +0200, Vincent Guittot wrote:
> > On Tue, 20 May 2025 at 12:18, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > One of the things lost with introduction of DELAY_DEQUEUE is the
> > > ability of TTWU to move those tasks around on wakeup, since they're
> > > on_rq, and as such, need to be woken in-place.
> >
> > I was thinking that you would call select_task_rq() somewhere in the
> > wake up path of delayed entity to get a chance to migrate it which was
> > one reason for the perf regression (and which would have also been
> > useful for EAS case) but IIUC, the task is still enqueued on the same
> > CPU but the target cpu will do the enqueue itself instead on the local
> > CPU. Or am I missing something ?
>
> Correct. I tried to add that migration into the mix, but then things get
> tricky real fast.
Yeah, I can imagine
>
> Just getting rid of the remote rq lock also helped; these dispatch
> threads just need to get on with waking up tasks, any delay hurts.
>
> > >
> > > Doing the in-place thing adds quite a bit of cross-cpu latency, add a
> > > little something that gets remote CPUs to do their own in-place
> > > wakeups, significantly reducing the rq->lock contention.
> > >
> > > Reported-by: Chris Mason <clm@xxxxxxxx>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> > > ---
> > > kernel/sched/core.c | 74 ++++++++++++++++++++++++++++++++++++++++++------
> > > kernel/sched/fair.c | 5 ++-
> > > kernel/sched/features.h | 1
> > > kernel/sched/sched.h | 1
> > > 4 files changed, 72 insertions(+), 9 deletions(-)
> > >
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3784,6 +3784,8 @@ static int __ttwu_runnable(struct rq *rq
> > > return 1;
> > > }
> > >
> > > +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> > > +
> > > /*
> > > * Consider @p being inside a wait loop:
> > > *
> > > @@ -3811,6 +3813,33 @@ static int __ttwu_runnable(struct rq *rq
> > > */
> > > static int ttwu_runnable(struct task_struct *p, int wake_flags)
> > > {
> > > +#ifdef CONFIG_SMP
> > > + if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
> > > + /*
> > > + * Similar to try_to_block_task():
> > > + *
> > > + * __schedule() ttwu()
> > > + * prev_state = prev->state if (p->sched_delayed)
> > > + * if (prev_state) smp_acquire__after_ctrl_dep()
> > > + * try_to_block_task() p->state = TASK_WAKING
> > > + * ... set_delayed()
> > > + * RELEASE p->sched_delayed = 1
> > > + *
> > > + * __schedule() and ttwu() have matching control dependencies.
> > > + *
> > > + * Notably, once we observe sched_delayed we know the task has
> > > + * passed try_to_block_task() and p->state is ours to modify.
> > > + *
> > > + * TASK_WAKING controls ttwu() concurrency.
> > > + */
> > > + smp_acquire__after_ctrl_dep();
> > > + WRITE_ONCE(p->__state, TASK_WAKING);
> > > +
> > > + if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
> > > + return 1;
> > > + }
> > > +#endif
> > > +
> > > CLASS(__task_rq_lock, guard)(p);
> > > return __ttwu_runnable(guard.rq, p, wake_flags);
> > > }
> > > @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
> > > update_rq_clock(rq);
> > >
> > > llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > > + struct rq *p_rq = task_rq(p);
> > > + int ret;
> > > +
> > > + /*
> > > + * This is the ttwu_runnable() case. Notably it is possible for
> > > + * on-rq entities to get migrated -- even sched_delayed ones.
> >
> > I haven't found where the sched_delayed task could migrate on another cpu.
>
> Doesn't happen often, but it can happen. Nothing really stops it from
> happening. Eg weight based balancing can do it. As can numa balancing
> and affinity changes.
Yes, I agree that delayed tasks can migrate because of load balancing
but not at wake up.