[PATCH v3 1/7] sched/fair: Add cgroup_mode switch

From: Peter Zijlstra

Date: Fri Jun 05 2026 - 08:48:12 EST



The effective task weight (W_t') for a task in cgroup g on CPU n is given by:

W_t
W_t' = W_g * F_g_n * ----------
\Sum W_t_n

Where W_g is the group's weight (cpu.weight), F_g_n is the fraction of the
group weight for CPU n and W_t/W is the relative weight of this task against
all other tasks in the same group on the same CPU.

Furthermore, this makes:

\Sum W_t_n
F_g_n = ----------
\Sum W_t

The fraction of weight inside the group of CPU n against the whole group.

The problem is with F_g_n, the primary goal of this fraction is to make sure
that the relative weight of tasks, when distributed over CPUs is maintained.
For example, consider 4 (equal weight) tasks and 2 CPUs with a 1:3
distribution, then if F_g_n would simply be 1 (no weight re-distribution) the
effective relative weights (W_t') of the tasks in our group would be:

CPU0 CPU1
W_g W_g/3
W_g/3
W_g/3

IOW, the lucky task on CPU0 would get an equal amount of weight as all 3 tasks
on CPU1 combined. However, with the weight redistribution, this becomes:

CPU0 CPU1
W_g/4 W_g/4
W_g/4
W_g/4

All tasks are equal weight (as intended). However, as is already evident from
this example, the more CPUs you add, the smaller F_g_n becomes, which creates a
disparity against tasks not in our group.

Specifically:

avg(F_g_n) ~ 1/N

This leads to a weight mismatch in the hierarchy. IOW tasks cannot compete
fairly across hierarchy levels.

*Notably*, what is meant by avg(F_g_n) being proportional to 1/N is that when
there are at least N runnable tasks, the average of this fraction tends to 1/N.

For a hierarchy of depth d, this gets even worse, since that gets terms on the
order of:

avg(F_g_n)^d ~ 1/(N^d)

Given fixed point arithmetic, this also leads to numerical trouble.

However, the meaning of "cpu.weight" is simple and intiutive: the total weight
of the cgroup. But as explored above, there is deception in this simplicity.

Prepare to add a few alternative methods for distributing weight.

Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
kernel/sched/debug.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 74 insertions(+)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -633,6 +633,76 @@ static void debugfs_fair_server_init(voi
}
}

+#ifdef CONFIG_FAIR_GROUP_SCHED
+static int cgroup_mode = 0;
+
+static const char *cgroup_mode_str[] = {
+ "smp",
+};
+
+static int sched_cgroup_mode(const char *str)
+{
+ for (int i = 0; i < ARRAY_SIZE(cgroup_mode_str); i++) {
+ if (!strcmp(str, cgroup_mode_str[i]))
+ return i;
+ }
+ return -EINVAL;
+}
+
+static ssize_t sched_cgroup_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ char buf[16];
+ int mode;
+
+ if (cnt > 15)
+ cnt = 15;
+
+ if (copy_from_user(buf, ubuf, cnt))
+ return -EFAULT;
+
+ buf[cnt] = 0;
+ mode = sched_cgroup_mode(strstrip(buf));
+ if (mode < 0)
+ return mode;
+
+ WRITE_ONCE(cgroup_mode, mode);
+
+ *ppos += cnt;
+ return cnt;
+}
+
+static int sched_cgroup_show(struct seq_file *m, void *v)
+{
+ int mode = READ_ONCE(cgroup_mode);
+
+ for (int i = 0; i < ARRAY_SIZE(cgroup_mode_str); i++) {
+ if (mode == i)
+ seq_puts(m, "(");
+ seq_puts(m, cgroup_mode_str[i]);
+ if (mode == i)
+ seq_puts(m, ")");
+
+ seq_puts(m, " ");
+ }
+ seq_puts(m, "\n");
+ return 0;
+}
+
+static int sched_cgroup_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, sched_cgroup_show, NULL);
+}
+
+static const struct file_operations sched_cgroup_fops = {
+ .open = sched_cgroup_open,
+ .write = sched_cgroup_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+#endif
+
static __init int sched_init_debug(void)
{
struct dentry __maybe_unused *numa, *llc;
@@ -686,6 +756,10 @@ static __init int sched_init_debug(void)

debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);

+#ifdef CONFIG_FAIR_GROUP_SCHED
+ debugfs_create_file("cgroup_mode", 0644, debugfs_sched, NULL, &sched_cgroup_fops);
+#endif
+
debugfs_fair_server_init();
#ifdef CONFIG_SCHED_CLASS_EXT
debugfs_ext_server_init();