Re: [PATCH v2] sched/fair: Revert boost in cpu_util()

From: Vincent Guittot

Date: Thu Jun 04 2026 - 03:45:12 EST


On Thu, 28 May 2026 at 04:36, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
>
> From: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
>
> We have seen a massive power consumption regression (20% SoC power
> increase in many apps) after updating our kernel. After bisection we

It's always good to provide more details: kernel, version, hardware
and the test condition

> pinpointed the regression to the cpu_util(boost) feature. After
> reverting the boost feature the massive energy regression is gone.
> Detailed trace analysis down below. The regression is found across quite
> many apps but Youtube is one of the worst offenders. Some energy
> benchmark numbers are here.
>
> Youtube 1080p60fps video benchmark:
> FPS SoC Power diff
> w/ boost 59.94 913.6mW
> w/o boost 59.93 720.4mW -21.15%
>
> Mobile Legends (gaming)
> FPS sdev Total power diff
> w/ boost 120.16 0.47 3294.10mW
> w/o boost 120.07 0.56 2996.09mW -9.05%
>
> Genshin Impact (gaming, medium quality)
> FPS sdev Total power diff
> w/ boost 60.05 0.34 6215.84mW
> w/o boost 60.03 0.35 5695.46mW -8.37%
>
> Signed-off-by: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
>
> ---
> Changed in v2:
> - Sync all comments with code changes.
> - Update commit message with more benchmark numbers.
>
> Analysis:
>
> We found several problems that result in the power spike:
>
> 1. Arithmetic should not happen between util_avg and runnable_avg:
>
> After util = max(util, runnable) which potentially picks runnable value
> in cpu_util(), we then add or subtract task util values from it. This
> produces a value that is half-runnable-half-util which is ill-defined.
> This alone should be a warning sign. This breaks EAS calculations in
> many cases, leading to sub-optimal task placements.

This can be easily fixed

>
> 2. Using the absolute value of runnable_avg to drive frequency is
> too high to be reasonable:
>
> Schedutil use runnable in a _relative_ way to util to know whether there
> is contention in several places. However, the _absolute_ value should
> not be used like util. Runnable_avg tends to be significantly higher,
> making it much easier to saturate frequency.
>
> For example, if three tasks each with a util of 100 contend on the same
> rq, the rq util is 300 but runnable_avg shoots up to 600, which is often
> much higher than needed.

In the email thread of the prev version, you said that using
runnable_avg is good but not like the current implementation. So
instead of blindly reverting it, please submit a better usage, as this
was added to fix some performance issues.

>
> 3. Runnable_avg may not even reflect true contention:
>
> When tasks are dependent, the bottleneck is often the data flow between
> tasks, not the contention seen by runnable_avg. Boosting frequency with
> runnable in such scenarios wastes power without performance benefits.
>
> We found 1 has minor power regression but 2 and 3 regresses power
> significantly. We have seen multiple applications with the
> producer-consumer model with many worker threads suffer. When there is
> IPC between producer and consumer, boosting frequency blindly does not
> help performance at all if consumer is limited by how much data is flown
> through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
> total SoC power regression of 20% shown in the results above.

Tasks contention is a real problem and runnable_avg is one metric that
reflects this.

>
> ---
> kernel/sched/cpufreq_schedutil.c | 2 +-
> kernel/sched/fair.c | 34 ++++++++------------------------
> kernel/sched/sched.h | 1 -
> 3 files changed, 9 insertions(+), 28 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index ae9fd211cec1..ba867192513b 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -228,7 +228,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
>
> if (!scx_switched_all())
> - util += cpu_util_cfs_boost(sg_cpu->cpu);
> + util += cpu_util_cfs(sg_cpu->cpu);
> util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> util = max(util, boost);
> sg_cpu->bw_min = min;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..ecf8b4860951 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8192,7 +8192,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> * @cpu: the CPU to get the utilization for
> * @p: task for which the CPU utilization should be predicted or NULL
> * @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
> - * @boost: 1 to enable boosting, otherwise 0
> *
> * The unit of the return value must be the same as the one of CPU capacity
> * so that CPU utilization can be compared with CPU capacity.
> @@ -8210,12 +8209,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> * be when a long-sleeping task wakes up. The contribution to CPU utilization
> * of such a task would be significantly decayed at this point of time.
> *
> - * Boosted CPU utilization is defined as max(CPU runnable, CPU utilization).
> - * CPU contention for CFS tasks can be detected by CPU runnable > CPU
> - * utilization. Boosting is implemented in cpu_util() so that internal
> - * users (e.g. EAS) can use it next to external users (e.g. schedutil),
> - * latter via cpu_util_cfs_boost().
> - *
> * CPU utilization can be higher than the current CPU capacity
> * (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
> * of rounding errors as well as task migrations or wakeups of new tasks.
> @@ -8226,19 +8219,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> * though since this is useful for predicting the CPU capacity required
> * after task migrations (scheduler-driven DVFS).
> *
> - * Return: (Boosted) (estimated) utilization for the specified CPU.
> + * Return: (Estimated) utilization for the specified CPU.
> */
> static unsigned long
> -cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> +cpu_util(int cpu, struct task_struct *p, int dst_cpu)
> {
> struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
> - unsigned long runnable;
> -
> - if (boost) {
> - runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
> - util = max(util, runnable);
> - }
>
> /*
> * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
> @@ -8295,12 +8282,7 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>
> unsigned long cpu_util_cfs(int cpu)
> {
> - return cpu_util(cpu, NULL, -1, 0);
> -}
> -
> -unsigned long cpu_util_cfs_boost(int cpu)
> -{
> - return cpu_util(cpu, NULL, -1, 1);
> + return cpu_util(cpu, NULL, -1);
> }
>
> /*
> @@ -8322,7 +8304,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
> if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
> p = NULL;
>
> - return cpu_util(cpu, p, -1, 0);
> + return cpu_util(cpu, p, -1);
> }
>
> /*
> @@ -8489,7 +8471,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> int cpu;
>
> for_each_cpu(cpu, pd_cpus) {
> - unsigned long util = cpu_util(cpu, p, -1, 0);
> + unsigned long util = cpu_util(cpu, p, -1);
>
> busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> }
> @@ -8513,7 +8495,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
>
> for_each_cpu(cpu, pd_cpus) {
> struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
> - unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> + unsigned long util = cpu_util(cpu, p, dst_cpu);
> unsigned long eff_util, min, max;
>
> /*
> @@ -8675,7 +8657,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> continue;
>
> - util = cpu_util(cpu, p, cpu, 0);
> + util = cpu_util(cpu, p, cpu);
> cpu_cap = capacity_of(cpu);
>
> /*
> @@ -11848,7 +11830,7 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
> break;
>
> case migrate_util:
> - util = cpu_util_cfs_boost(i);
> + util = cpu_util_cfs(i);
>
> /*
> * Don't try to pull utilization from a CPU with one
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d..1c934dd126b2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3551,7 +3551,6 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>
>
> extern unsigned long cpu_util_cfs(int cpu);
> -extern unsigned long cpu_util_cfs_boost(int cpu);
>
> static inline unsigned long cpu_util_rt(struct rq *rq)
> {
> --
> 2.47.3
>