Re: [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy

From: Aaron Lu

Date: Mon Jun 01 2026 - 02:19:13 EST

Hi Prateek,

On Thu, May 28, 2026 at 09:48:25AM +0000, K Prateek Nayak wrote:
> The current hierarchy is always throttled in __schedule() during the
> pick when update_curr() detects a cfs_rq running out of the bandwidth
> and issues a resched.
>
> This was necessary prior to per-task throttling where the entire
> throttled hierarchy was dequeued at the point of first throttle during
> the pick but with per-task throttling, tasks continue to run as usual
> until they exit to userspace and dequeue themselves one-by-one until the
> hierarchy is deemed fully throttled and the PELT is frozen.
>
> throttle_cfs_rq() is now simply a propagator of throttle indicators and
> nothing more.
>
> Unify the throttling for current hierarchy under
> account_cfs_rq_runtime() which is responsible for the time accounting.
> If the bandwidth runs out, account_cfs_rq_runtime() will request for
> sched_cfs_bandwidth_slice() and mark the hierarchy as throttled if it
> fails to grab bandwidth.
>
> throttle_cfs_rq() will do a task_throttle_setup_work() if it finds the
> current task to be on a throttled hierarchy and the task will naturally
> dequeue itself when it exits to the userspace without needing an
> explicit resched.
>
> First four patches are cleanups and preparation for the final bit that
> switches over to using account_cfs_rq_runtime() for throttling which was
> provided by Peter in [1].
>
> Following are the results of running hackbench running 3 levels deep
> with the setup from "Testing" section on [2] when compared to
> tip:sched/core:
>
> kernel : tip tip + series
>
> Min : 207.33 202.20
> Max : 210.20 222.47
> Median : 207.83 218.33
> AMean : 208.29 215.36
> GMean : 208.29 215.25
> HMean : 208.29 215.13
> AMean Stddev : 1.02 7.37
> AMean CoefVar : 0.49 pct 3.42 pct
>
> All numbers are in seconds.
>
> There is a slight boot to boot variation for this benchmark but the
> utilization numbers in top is more or less similar between the two.
> Additional testing and feedback is always appreciated as usual :-)

I tested hackbench and netperf with quota set on a 2 sockets Intel EMR
and the result is in noise range.

Hackbench(in seconds, less is better)
base: 176.114420±2
head: 176.214394±3

Netperf(throughput, higher is better)
base: 14071, min/max: 13376/15261
head: 14769, min/max: 14095/15588

Feel free to add my tested-by tag after the clock warning is fixed in
patch 3.