Re: [RFC][PATCH 8/8] sched/eevdf: Move to a single runqueue

From: Peter Zijlstra

Date: Wed Mar 18 2026 - 05:03:17 EST

On Tue, Mar 17, 2026 at 11:16:52PM +0530, K Prateek Nayak wrote:

> > + /*
> > + * XXX comment on the curr thing
> > + */
> > + curr = (cfs_rq->curr == se);
> > + if (curr)
> > + place_entity(cfs_rq, se, flags);
> >
> > + if (se->on_rq && se->sched_delayed)
> > + requeue_delayed_entity(cfs_rq, se);
> >
> > + weight = enqueue_hierarchy(p, flags);
>
> Here is question I had when I first saw this on sched/flat and I've
> only looked at the series briefly:
>
> enqueue_hierarchy() would end up updating the averages, and reweighing
> the hierarchical load of the entities in the new task's hierarchy ...
>
> >
> > + if (!curr) {
> > + reweight_eevdf(cfs_rq, se, weight, false);
> > + place_entity(cfs_rq, se, flags | ENQUEUE_QUEUED);
>
> ... and the hierarchical weight of the newly enqueued task would be
> based on this updated hierarchical proportion.
>
> However, the tasks that are already queued have their deadlines
> calculated based on the old hierarchical proportions at the time they
> were enqueued / during the last task_tick_fair() for an entity that
> was put back.
>
> Consider two tasks of equal weight on cgroups with equal weights:
>
> root (weight: 1024)
> / \
> CG0 CG1 (wight(CG0,CG1) = 512)
> | |
> T0 T1 (h_weight(T0,T1) = 256)
>
>
> and a third task of equal weight arrives (for the sake of simplicity
> also consider both cgroups have saturated their respective global
> shares on this CPU - similar to UP mode):
>
>
> root (weight: 1024)
> / \
> (weight: 512) CG0 CG1 (weight: 512)
> / / \
> (h_weight(T0) = 256) T0 T1 T2 (h_weight(T2) = 128)
>
> (h_weight(T1) = 256)
>
>
> Logically, once T2 arrives, T1 should also be reweighed, it's
> hierarchical proportions be adjusted, and its vruntime and deadline
> be also adjusted accordingly based on the lag but that doesn't
> happen.

You are absolutely right.

> Instead, we continue with an approximation of h_load as seen
> sometime during the past. Is that alright with EEVDF or am I missing
> something?

Strictly speaking it is dodgy as heck ;-) I was hoping that on average
it would all work out. Esp. since PELT is a fairly slow and smooth
function, the reweights will mostly be minor adjustments.

> Can it so happen that on SMP, future enqueues, and SMP conditions
> always lead to larger h_load for the newly enqueued tasks and as a
> result the older tasks become less favorable for the pick leading
> to starvation? (Am I being paranoid?)

So typically the most recent enqueue will always have the smaller
fraction of the group weight. This would lead to a slight favour to the
older enqueue. So I think this would lead to a FIFO like bias.

But there is definitely some fun to be had here.

One definite fix is setting cgroup_mode to 'up' :-)

> > + __enqueue_entity(cfs_rq, se);
> > }
> >
> > if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
>
> Anyhow, me goes and sees if any of this makes a difference to the
> benchmarks - I'll throw the biggest one at it first and see how
> that goes.

Thanks, fingers crossed. :-)