[RFC][PATCH 0/8] sched: Flatten the pick
From: Peter Zijlstra
Date: Tue Mar 17 2026 - 06:53:53 EST
Hi!
So cgroup scheduling has always been a pain in the arse. The problems start
with weight distribution and end with hierachical picks and it all sucks.
The problems with weight distribution are related to that infernal global
fraction:
tg->w * grq_i->w
ge_i->w = ----------------
\Sum_j grq_j->w
which we've approximated reasonably well by now. However, the immediate
consequence of this fraction is that the total group weight (tg->w) gets
fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup
weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine
with the fact that 256 CPU systems are relatively common these days, this
becomes painful.
The common 'solution' is to inflate the group weight by 'nr_cpus'; the
immediate problem with that is that when all load of a group gets concentrated
on a single CPU, the per-cpu cgroup weight becomes insanely large, easily
exceeding nice -20.
Additionally there are numerical limits on the max weight you can have before
the math starts suffering overflows. As such there is a definite limit on the
total group weight. Which has annoyed people ;-)
The first few patches add a knob /debug/sched/cgroup_mode and a few different
options on how to deal with this. My favourite is 'concur', but obviously that
is also the most expensive one :-/ It adds a tg->tasks counter which makes the
update_tg_load_avg() thing more expensive.
I have some ideas but I figured I ought to share these things before sinking
more time into it.
On to the hierarchical pick; this has been causing trouble for a very long
time. So once again an attempt at flatting it. The basic idea is to keep the
full hierarchical load tracking as-is, but keep all the runnable entities in a
single level. The immediate concequence of all this is ofcourse that we need to
constantly re-compute the effective weight of each entity as things progress.
Reweight is done on:
- enqueue
- pick -- or rather set_next_entity(.first=true)
- tick
So while the {en,de}queue operations are still O(depth) due to the full
accounting mess, the pick is now a single level. Removing the intermediate
levels that obscure runnability etc.
Anyway, these patches stopped crashing a few days ago and I figured it ought to
be good enough to post.
Can also be had:
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat
---
include/linux/cpuset.h | 6 +
include/linux/sched.h | 1 +
kernel/cgroup/cpuset.c | 15 +
kernel/sched/core.c | 27 +-
kernel/sched/debug.c | 188 ++++++---
kernel/sched/fair.c | 1032 ++++++++++++++++++++++--------------------------
kernel/sched/sched.h | 20 +-
7 files changed, 641 insertions(+), 648 deletions(-)