Re: [PATCH v2] sched/fair: Revert boost in cpu_util()

From: Vincent Guittot

Date: Fri Jun 05 2026 - 02:49:05 EST


On Fri, 5 Jun 2026 at 04:49, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
>
> Hi Vincent,
>
> Should we fix the runnable_avg calculation first? We only take
> max(runnable_avg, util_avg) after all calculations are done. Diff:

Yes. Please submit a proper patch for the below
With a Fixes: tag

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 69361c63353a..050bce06efea 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8231,25 +8231,31 @@ static int select_idle_sibling(struct
> task_struct *p, int prev, int target)
> static unsigned long
> cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> {
> + bool add_task = p && task_cpu(p) != cpu && dst_cpu == cpu;
> + bool sub_task = p && task_cpu(p) == cpu && dst_cpu != cpu;
> struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
> unsigned long runnable;
>
> - if (boost) {
> - runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
> - util = max(util, runnable);
> - }
> -
> /*
> * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
> * contribution. If @p migrates from another CPU to @cpu add its
> * contribution. In all the other cases @cpu is not impacted by the
> * migration so its util_avg is already correct.
> */
> - if (p && task_cpu(p) == cpu && dst_cpu != cpu)
> - lsub_positive(&util, task_util(p));
> - else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
> + if (add_task)
> util += task_util(p);
> + else if (sub_task)
> + lsub_positive(&util, task_util(p));
> +
> + if (boost) {
> + runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
> + if (add_task)
> + runnable += task_runnable(p);
> + else if (sub_task)
> + lsub_positive(&runnable, task_runnable(p));
> + util = max(util, runnable);
> + }
>
> if (sched_feat(UTIL_EST)) {
> unsigned long util_est;
>
> Hi Dietmar,
>
> On 6/4/2026 9:21 PM, Dietmar Eggemann wrote:
> > On 04.06.26 11:21, Hongyan Xia wrote:
> >> On 6/4/2026 4:48 PM, Vincent Guittot wrote:
> >>> On Thu, 4 Jun 2026 at 10:21, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
> >>>>
> >>>> On 6/4/2026 3:42 PM, Vincent Guittot wrote:
> >>>>> On Thu, 28 May 2026 at 04:36, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> From: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
> >
> > [...]
> >
> >>>>>> Analysis:
> >>>>>>
> >>>>>> We found several problems that result in the power spike:
> >>>>>>
> >>>>>> 1. Arithmetic should not happen between util_avg and runnable_avg:
> >>>>>>
> >>>>>> After util = max(util, runnable) which potentially picks runnable value
> >>>>>> in cpu_util(), we then add or subtract task util values from it. This
> >>>>>> produces a value that is half-runnable-half-util which is ill-defined.
> >>>>>> This alone should be a warning sign. This breaks EAS calculations in
> >>>>>> many cases, leading to sub-optimal task placements.
> >>>>>
> >>>>> This can be easily fixed
> >>>>
> >>>> I thought about adding or subtracting runnable_avg instead, but that is
> >>>> still wrong. Given three tasks each with 100 util, if they wake up at
> >>>> the same time and running on the same rq, their util is 100, 100, 100,
> >>>> rq total util is 300. Their runnable_avg is 100, 200, 300, rq total
> >>>> runnable_avg is 600. If the 1st task leaves the rq, the remaining two
> >>>> task runnable_avg will then become 100, 200, giving a total rq
> >>>> runnable_avg of 300. However, subtracting the runnable_avg of the 1st
> >>>> task gives 600 - 100 = 500, which is very wrong.
> >>>
> >>> Substracting/adding se.avg.runnable_avg is still the right solution
> >>> because this is what will happen if the task migrate
> >>
> >> One difference is that before runnable boost, runnable_avg really
> >> doesn't affect EAS and CPUFreq much. Runnable boost is the first
> >> instance where we directly use raw runnable_avg values in EAS and
> >> frequency selection, and this value is often too high to be reasonable.
> >> I'm mostly arguing that we should use it in proper places (like the one
> >> in util_est_update()) and not here.
> >>
> >> Actually this is the very first fix we tried internally. There is minor
> >> improvement in EAS spreading out tasks, but energy regression is pretty
> >> much the same, and Youtube is still at 20% regression.
> >
> > I thought the issue was that, in your low-power test cases, most of the
> > tasks involved in these contention scenarios are raising CPU frequency,
> > but they are not directly contributing to the workload (including
> > Android Graphics Pipeline (AGP) progress). If that's the case, the
> > additional power consumption is essentially wasted.
> >
> > Could you collect some information about these contention events and
> > identify the tasks involved?
>
> Our profiling shows that there is a very typical scenario in these
> energy regressions, which is IPC.
>
> Quite a lot of apps have per-CPU worker threads taking data from
> producers. When a producer sends the first piece of data (mostly using
> wake_up_sync()), a worker thread on the same CPU gets woken up, but that
> worker is not immediately switched to, which is right because you want
> the producer to finish producing data. But, because that worker thread
> is already runnable, it starts to shoot up its runnable_avg and the rq
> runnable_avg. When the producer is done and we finally context switch to
> the worker, the frequency is already super high and it takes time for
> the frequency to decay.
>
> Two problems with this:
>
> 1. In this common scenario, the producer and per-CPU consumer threads
> are not really 'contending'. They are accomplishing the same job but
> just split into multiple stages. The 'contention' here should be much
> smaller than two independent threads doing two separate jobs.
>
> 2. When we finally context switch to the worker thread, the frequency is
> already very high. However, the performance is already dominated by how
> much data the producer has produced, and the high frequency here is
> useless most of the time. Of course a fast consumer helps in some cases,
> but in these IPC scenarios it's mostly data-limited.
>
> Actually even without this scenario, the frequency driven by
> runnable_avg is still too high, but the above case is where things are
> pretty bad.
> > By the way, I still assume that all per-task ADPF boosting (including
> > AGP boosting) is disabled on your system. I also assume that the number
> > of janky frames is negligible, since you're only showing FPS data.
>
> Yep, your assumption is right indeed.
>
> > [...]
>