Re: [PATCH v2] sched/fair: Revert boost in cpu_util()

From: Hongyan Xia

Date: Thu Jun 04 2026 - 22:50:30 EST

Hi Vincent,

Should we fix the runnable_avg calculation first? We only take
max(runnable_avg, util_avg) after all calculations are done. Diff:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69361c63353a..050bce06efea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8231,25 +8231,31 @@ static int select_idle_sibling(struct
task_struct *p, int prev, int target)
static unsigned long
cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
{
+ bool add_task = p && task_cpu(p) != cpu && dst_cpu == cpu;
+ bool sub_task = p && task_cpu(p) == cpu && dst_cpu != cpu;
struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
unsigned long runnable;

- if (boost) {
- runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
- util = max(util, runnable);
- }
-
/*
* If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
* contribution. If @p migrates from another CPU to @cpu add its
* contribution. In all the other cases @cpu is not impacted by the
* migration so its util_avg is already correct.
*/
- if (p && task_cpu(p) == cpu && dst_cpu != cpu)
- lsub_positive(&util, task_util(p));
- else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
+ if (add_task)
util += task_util(p);
+ else if (sub_task)
+ lsub_positive(&util, task_util(p));
+
+ if (boost) {
+ runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
+ if (add_task)
+ runnable += task_runnable(p);
+ else if (sub_task)
+ lsub_positive(&runnable, task_runnable(p));
+ util = max(util, runnable);
+ }

if (sched_feat(UTIL_EST)) {
unsigned long util_est;

Hi Dietmar,

On 6/4/2026 9:21 PM, Dietmar Eggemann wrote:
> On 04.06.26 11:21, Hongyan Xia wrote:
>> On 6/4/2026 4:48 PM, Vincent Guittot wrote:
>>> On Thu, 4 Jun 2026 at 10:21, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
>>>>
>>>> On 6/4/2026 3:42 PM, Vincent Guittot wrote:
>>>>> On Thu, 28 May 2026 at 04:36, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> From: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
>
> [...]
>
>>>>>> Analysis:
>>>>>>
>>>>>> We found several problems that result in the power spike:
>>>>>>
>>>>>> 1. Arithmetic should not happen between util_avg and runnable_avg:
>>>>>>
>>>>>> After util = max(util, runnable) which potentially picks runnable value
>>>>>> in cpu_util(), we then add or subtract task util values from it. This
>>>>>> produces a value that is half-runnable-half-util which is ill-defined.
>>>>>> This alone should be a warning sign. This breaks EAS calculations in
>>>>>> many cases, leading to sub-optimal task placements.
>>>>>
>>>>> This can be easily fixed
>>>>
>>>> I thought about adding or subtracting runnable_avg instead, but that is
>>>> still wrong. Given three tasks each with 100 util, if they wake up at
>>>> the same time and running on the same rq, their util is 100, 100, 100,
>>>> rq total util is 300. Their runnable_avg is 100, 200, 300, rq total
>>>> runnable_avg is 600. If the 1st task leaves the rq, the remaining two
>>>> task runnable_avg will then become 100, 200, giving a total rq
>>>> runnable_avg of 300. However, subtracting the runnable_avg of the 1st
>>>> task gives 600 - 100 = 500, which is very wrong.
>>>
>>> Substracting/adding se.avg.runnable_avg is still the right solution
>>> because this is what will happen if the task migrate
>>
>> One difference is that before runnable boost, runnable_avg really
>> doesn't affect EAS and CPUFreq much. Runnable boost is the first
>> instance where we directly use raw runnable_avg values in EAS and
>> frequency selection, and this value is often too high to be reasonable.
>> I'm mostly arguing that we should use it in proper places (like the one
>> in util_est_update()) and not here.
>>
>> Actually this is the very first fix we tried internally. There is minor
>> improvement in EAS spreading out tasks, but energy regression is pretty
>> much the same, and Youtube is still at 20% regression.
>
> I thought the issue was that, in your low-power test cases, most of the
> tasks involved in these contention scenarios are raising CPU frequency,
> but they are not directly contributing to the workload (including
> Android Graphics Pipeline (AGP) progress). If that's the case, the
> additional power consumption is essentially wasted.
>
> Could you collect some information about these contention events and
> identify the tasks involved?

Our profiling shows that there is a very typical scenario in these
energy regressions, which is IPC.

Quite a lot of apps have per-CPU worker threads taking data from
producers. When a producer sends the first piece of data (mostly using
wake_up_sync()), a worker thread on the same CPU gets woken up, but that
worker is not immediately switched to, which is right because you want
the producer to finish producing data. But, because that worker thread
is already runnable, it starts to shoot up its runnable_avg and the rq
runnable_avg. When the producer is done and we finally context switch to
the worker, the frequency is already super high and it takes time for
the frequency to decay.

Two problems with this:

1. In this common scenario, the producer and per-CPU consumer threads
are not really 'contending'. They are accomplishing the same job but
just split into multiple stages. The 'contention' here should be much
smaller than two independent threads doing two separate jobs.

2. When we finally context switch to the worker thread, the frequency is
already very high. However, the performance is already dominated by how
much data the producer has produced, and the high frequency here is
useless most of the time. Of course a fast consumer helps in some cases,
but in these IPC scenarios it's mostly data-limited.

Actually even without this scenario, the frequency driven by
runnable_avg is still too high, but the above case is where things are
pretty bad.
> By the way, I still assume that all per-task ADPF boosting (including
> AGP boosting) is disabled on your system. I also assume that the number
> of janky frames is negligible, since you're only showing FPS data.

Yep, your assumption is right indeed.

> [...]