Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
From: John Stultz
Date: Mon Mar 30 2026 - 17:52:18 EST
On Mon, Mar 30, 2026 at 12:43 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, Mar 30, 2026 at 12:40:45PM -0700, John Stultz wrote:
> > On Mon, Mar 30, 2026 at 3:10 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > This means, that if the two tasks playing leapfrog can reach the
> > > critical speed to reach the overflow point inside one tick's worth of
> > > time, we're up a creek.
> > >
> > > If this is indeed the case, then the below should cure things.
> > >
> > > This also means that running a HZ=100 config will increase the chances
> > > of hitting this vs HZ=1000.
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 9298f49f842c..c7daaf941b26 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9307,6 +9307,7 @@ static void yield_task_fair(struct rq *rq)
> > > if (entity_eligible(cfs_rq, se)) {
> > > se->vruntime = se->deadline;
> > > se->deadline += calc_delta_fair(se->slice, se);
> > > + avg_vruntime(cfs_rq);
> > > }
> > > }
> >
> > I just tested with this and similar to Prateek, I also still tripped the issue.
> >
> > I'll give your new patch a spin here in a second.
>
> Stick both on please :-) AFAICT they're both real, just not convinced
> they're what you're hitting.
Sadly I'm still hitting it with both. This time the stack trace was
different, and it came up through do_nanosleep() from stress-ng-exit
instead of yield.
I'll re-add my debug trace_printks (I dropped them while testing your
patches in case they changed the timing of things) and work to
understand more here.
thanks
-john
[ 6777.071789] BUG: kernel NULL pointer dereference, address: 0000000000000051
[ 6777.076712] #PF: supervisor read access in kernel mode
[ 6777.079767] #PF: error_code(0x0000) - not-present page
[ 6777.082787] PGD 0 P4D 0
[ 6777.084361] Oops: Oops: 0000 [#1] SMP NOPTI
[ 6777.086812] CPU: 37 UID: 0 PID: 531349 Comm: stress-ng-exit-
Tainted: G W 7.0.0-rc1-00001-gb3d99f43c72b-dirty #18
PREEMPT(full)
[ 6777.094026] Tainted: [W]=WARN
[ 6777.095771] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.17.0-debian-1.17.0-1 04/01/2014
[ 6777.100689] RIP: 0010:pick_task_fair+0x6f/0xb0
[ 6777.103239] Code: 85 ff 74 52 48 8b 47 48 48 85 c0 74 d6 80 78 50
00 74 d0 48 89 3c 24 e8 8f e0 ff ff 48 8b 3c 24 be 01 00 00 00 e8 31
77 ff ff <80> 78 51 00 74 c3 ba 2
1 00 00 00 48 89 c6 48 89 df e8 db f1 ff ff
[ 6777.113447] RSP: 0018:ffffc9000f7dbcf0 EFLAGS: 00010082
[ 6777.116283] RAX: 0000000000000000 RBX: ffff8881b976bbc0 RCX: 0000000000000800
[ 6777.119791] RDX: 000000000a071800 RSI: 000000000b719000 RDI: 00004fc5ab7864c9
[ 6777.123608] RBP: ffffc9000f7dbdf0 R08: 0000000000000400 R09: 0000000000000002
[ 6777.127785] R10: 0000000000000025 R11: 0000000000000000 R12: ffff88810adc4200
[ 6777.131937] R13: ffff88810adc4200 R14: ffffffff82ce5b28 R15: ffff8881b976bbc0
[ 6777.135994] FS: 00007fc1c37866c0(0000) GS:ffff888235c2b000(0000)
knlGS:0000000000000000
[ 6777.140449] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6777.143756] CR2: 0000000000000051 CR3: 000000014160f005 CR4: 0000000000370ef0
[ 6777.147819] Call Trace:
[ 6777.149379] <TASK>
[ 6777.150653] pick_next_task_fair+0x3c/0x8c0
[ 6777.153115] __schedule+0x1e8/0x1200
[ 6777.155241] ? do_nanosleep+0x1a/0x170
[ 6777.157336] schedule+0x3d/0x130
[ 6777.159150] do_nanosleep+0x88/0x170
[ 6777.161161] ? find_held_lock+0x2b/0x80
[ 6777.163201] hrtimer_nanosleep+0xba/0x1f0
[ 6777.165481] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 6777.167990] common_nsleep+0x34/0x60
[ 6777.169957] __x64_sys_clock_nanosleep+0xde/0x150
[ 6777.172443] do_syscall_64+0xf3/0x680
[ 6777.174409] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 6777.177176] RIP: 0033:0x7fc1cc92d9ee
[ 6777.179009] Code: 08 0f 85 f5 4b ff ff 49 89 fb 48 89 f0 48 89 d7
48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24
08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 80 00 00 00 00 48 83
ec 08