Re: [PATCH] sched/fair: Update zero_vruntime after clearing on_rq in dequeue_entity()

From: Peter Zijlstra

Date: Mon Mar 23 2026 - 05:58:23 EST


On Mon, Mar 23, 2026 at 08:52:21AM +0100, Vincent Guittot wrote:
> On Thu, 19 Mar 2026 at 12:43, Zicheng Qu <quzicheng@xxxxxxxxxx> wrote:
> >
> > When dequeuing the current entity (cfs_rq->curr) in dequeue_entity(),
> > the cfs_rq->zero_vruntime is updated via update_entity_lag() ->
> > avg_vruntime() -> update_zero_vruntime() while curr->on_rq is still 1.
> > This means the current entity is still included in the zero_vruntime
> > calculation.
>
> curr is not included in zero_vruntime but added when computing
> avg_vruntime so zero_vruntime is not impacted when curr is dequeued

It is, we explicitly add curr back in.

> > However, immediately after this, curr->on_rq is set to 0, which should
> > change the avg_vruntime() result. Without re-updating zero_vruntime, the
> > stale value may be used in subsequent task selection paths:
> >
> > schedule() -> ... -> pick_task_fair() -> pick_next_entity() ->
> > pick_eevdf() -> vruntime_eligible()
> >
> > If entity_tick() -> avg_vruntime() -> update_zero_vruntime() is not
> > triggered in time between dequeue and the next pick, vruntime_eligible()
> > may use an inaccurate cfs_rq->zero_vruntime. This can potentially cause
> > all tasks to appear ineligible, leading to NULL pointer dereference.

This makes no sense.

One entity worth of vruntime should not affect things to the point of
overrun. Yes, it is true that zero_vruntime != avg_vruntime() right
after a dequeue, but that doesn't matter.

vruntime_eligible() does the same math that avg_vruntime() does and
takes this difference into account.

As long as zero_vruntime is close 'enough' to avg_vruntime, all the
deltas are small and nothing overflows.