Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()

From: K Prateek Nayak

Date: Tue Apr 28 2026 - 12:34:36 EST


(+ scheduler folks)

Hello Zhan,

On 4/28/2026 8:19 PM, Zhan Xusheng wrote:
> After commit 556146ce5e94 ("sched/fair: Avoid overflow in
> enqueue_entity()"), place_entity() can shift cfs_rq->zero_vruntime
> towards a newly enqueued heavy entity. This can make (vruntime -
> zero_vruntime) very large for other entities and cause key * load in
> vruntime_eligible() to overflow s64, flipping the eligibility result.

So the commit in question moves the zero_vruntime only when the
load > sum_weight.

You seem to have found a case where the entity_key() is already large
enough that moving the zero_vruntime farther will make the eligibility
check overflow which we were hoping will not be the case.

Do you have a reproducer that fails pick_eevdf() after introduction of
commit 556146ce5e94? Also, do you see any splats in the dmesg since we
have a defensive WARN_ON() to catch an overflow.

>
> Use check_mul_overflow() for the multiplication and fall back to a
> sign-based result on overflow.

Don't we have PARANOID_AVG for that? Can you do:

echo PARANOID_AVG > /sys/kernel/debug/sched/features
# run your workload
grep "sum_shift" /sys/kernel/debug/sched/debug

and check if the "sum_shift" turns non-zero.

>
> Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")

So we avoid a warning with that optimization but didn't see a crash
anywhere without it during my testing.

If you have a workload that crashes from those changes, perhaps we can
see if there is a cheap enough way to move the zero_vruntime closer to
the true average.

--
Thanks and Regards,
Prateek