Re: [PATCH 5/6] sched/proxy: Remove PROXY_WAKING
From: John Stultz
Date: Tue Jun 02 2026 - 03:03:54 EST
On Mon, Jun 1, 2026 at 10:22 PM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
> On 6/2/2026 2:02 AM, John Stultz wrote:
> > On Mon, Jun 1, 2026 at 3:54 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >> On Tue, May 26, 2026 at 01:16:14PM +0200, Peter Zijlstra wrote:
> >>> From: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> >>>
> >>> Now that the proxy path uses ->is_blocked, use the '->is_blocked &&
> >>> !->blocked_on' state instead of PROXY_WAKING. Notably, this is where a
> >>> blocked_on relation is broken but the donor task might still need a return
> >>> migration.
> >>>
> >>> (Not-yet-)Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> >>> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> >>
> >> Prateek, can I make that a normal SoB from you? I'm thinking I should
> >> merge sched/proxy into sched/core so we can get on with other stuff.
> >
> > Just as a heads up, so in stress testing[1] over the weekend with your
> > sched/proxy series, I hit the below null ptr traversal that seems to
> > be another pick_eevdf() returning null issue.
> >
> > I'm not sure if this is proxy related or not yet, so I'll be working
> > to reproduce (took ~31 hours to trip this one) and narrow it down.
> > But I'm wondering, given this pick_eevdf() returning null symptom has
> > been a regular issue for various bugs over time, do we need some
> > better debug checks to try to better these narrow down?
>
> I think PARANOID_AVG sched feat allows for some indication if things
> have gone sideways without crashing but there isn't an easy way to get
> the cfs_rq state which led to the crash without a crash kernel.
>
> >
> > This was using your tree at 4d92e41a046d, plus one workaround for
> > binutils on my system:
> > https://lore.kernel.org/lkml/7b45d196-063e-4e76-b08b-ec2bcc111328@xxxxxxxxxxxxx/
>
> Could you also try merging tip:sched/urgent into this branch and
> rerunning.
>
> commit b6eee96843e8 ("sched/fair: Fix overflow in
> vruntime_eligible()") in v7.1-rc3 moved to using 128-bit data type for
> the eligibility check and it can catch cases where an overflow in the
> multiplication will cause all entities to appear ineligible.
Oh, good point! I was thinking that was in in there, but it landed later.
Many thanks for pointing this out. I'll get the tests restarted.
thanks
-john