Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload

From: Dietmar Eggemann

Date: Tue Apr 14 2026 - 11:47:20 EST


On 13.04.26 21:44, John Stultz wrote:
> On Thu, Apr 9, 2026 at 9:01 PM John Stultz <jstultz@xxxxxxxxxx> wrote:
>>
>> On Thu, Apr 9, 2026 at 8:31 PM Xuewen Yan <xuewen.yan94@xxxxxxxxx> wrote:

[...]

>
> So unfortuantely, it looks like I'm going to have to eat my words here. :(
>
> In the android16-6.12 tree, the behavior has been isolated down to a
> backport of commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity
> placement bug causing scheduling lag").
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d71a9c6160479899ee744d2c6d6602a191deb1f
>
> Specifically the `place_entity(cfs_rq, se, 0);` addition in `reweight_entity()`
>
> I was assuming android16-6.12 was missing other related changes
> causing the bad behavior, but Xuewen pointed out similar problems
> could be seen on android17-6.18. I aligned that tree to the
> 6.18-stable branch, and could also reproduce it. Additionally, I
> tested with our andorid-mainline branch (as of 6.19 the base) and it
> showed the same issue. So this does "look" to be upstream related.
>
> Removing the place_entity() line added in commit 6d71a9c61604 from
> reweight_entity() seems to prevent the behavior.
>
> I've been able to trigger this "bad behavior" on devices using the
> rt-app with the configuration[1] Xuewen first provided (putting 10
> spinners per cpu on the bottom 4 cpus in a background v1 cpu cgroup).
> Then I run `cyclictest -m -t -a --policy=SCHED_OTHER -b 1000000 -D
> 120s` in the root cgroup, and (usually well) within two minutes I'll
> see > 1second delays in cyclictest.
>
> If I remove the cgroup from the rt-app config, the issue doesn't reproduce.
>
> Unfortuantely I've only been able to reproduce this on device, which
> requires the android kernel tree. I've installed an old debian11 image
> on x86 QEMU to be able to utilize cpu v1 cgroup support, but I haven't
> reproduced the issue there, which is a bit confounding.
>
> The issue doesn't immediatley trigger, but usually after a few seconds
> of normal behavior I'll start to see cyclictest on one or two of the
> cpus start to trip larger 100ms+ latencies until it hits the 1second
> boundary.
>
> Late last week I went digging into place_entity() to try to understand
> why it was tripping, but wasn't very succesful in narrowing down what
> might be going wrong. I did see that the place_entity() call seems to
> always be on a non-task se, and its almost always exiting at the `if
> (se->rel_deadline)` case. It does go through the lag calculation
> conditional, but not always.
>
> Anyway, I'm going to continue digging into this, but I just wanted to
> give folks a heads up in case there were any ideas to explore.
>
> thanks
> -john
>
>
> [1] Xuewen's rt-app config:
> {
> "tasks" : {
> "t0" : {
> "instance" : 40,
> "priority" : 0,
> "cpus" : [ 0, 1, 2, 3 ],
> "taskgroup" : "/background",
> "loop" : -1,
> "run" : 200,
> "sleep" : 50
> }
> }
> }

Just guessing ... does Android do more task
reweights (set_load_weight(p, true)) ?

on tip sched/core:

[PATCH v2 0/7] sched: Various reweight_entity() fixes

https://lore.kernel.org/r/20260219075840.162631716@xxxxxxxxxxxxx

db4551e2ba34 - sched/fair: Use full weight to __calc_delta() (2026-02-23
Peter Zijlstra)

101f3498b4bd - sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF
entity placement bug causing scheduling lag") (2026-02-23 Peter
Zijlstra) <-- !!!

4823725d9d1d - sched/fair: Increase weight bits for avg_vruntime
(2026-02-23 Peter Zijlstra)
9fe89f022c05 - sched/fair: More complex proportional newidle balance
(2026-02-23 Peter Zijlstra)

v7.0

6e3c0a4e1ad1 - sched/fair: Fix lag clamp (2026-02-23 Peter Zijlstra)
bcd74b2ffdd0 - sched/fair: Only set slice protection at pick time
(2026-02-23 Peter Zijlstra)
b3d99f43c72b - sched/fair: Fix zero_vruntime tracking (2026-02-23 Peter
Zijlstra)