Re: Announcing Sched QoS v0.1-alpha
From: Qais Yousef
Date: Wed Apr 15 2026 - 17:54:17 EST
On 04/15/26 09:57, Christian Loehle wrote:
> > Roles
> > =====
> >
> > This model is based on existing one shipped in the industry [1] that its users
> > are happy with. It breaks down the tasks' role into 4 classes:
> >
> > * USER_INTERACTIVE: Requires immediate response
> > * USER_INITIATED: Tolerates short latencies, but must get work done quickly still
> > * UTILITY: Tolerates long delays, but not prolonged ones
> > * BACKGROUND: Doesn't mind prolonged delays
> > * DEFAULT: All untagged tasks will get this category which will map to utility.
> >
> > EEVDF should allow us to describe these different levels via specifying
> > different runtime (custom slice) to each class. Shortest slice should still be
> > long enough not to sacrifice throughput. Nice values will operate as bandwidth
> > control so that long running user_interactive tasks can't be starved by long
> > running background ones if they had to run on the same CPU under overloaded
> > scenarios. uclamp_max help constraint power impact and access to expensive
> > highest performance levels.
> >
> > Mapping
> > -------
> >
> > {
> > "QOS_USER_INTERACTIVE": {
> > "sched_policy": "SCHED_NORMAL",
> > "sched_nice": -4,
> > "sched_runtime": 8000000,
> > "sched_util_max": 1024
> > },
> > "QOS_USER_INITIATED": {
> > "sched_policy": "SCHED_NORMAL",
> > "sched_nice": -2,
> > "sched_runtime": 12000000,
> > "sched_util_max": 768
> > },
> > "QOS_UTILITY": {
> > "sched_policy": "SCHED_BATCH",
> > "sched_nice": 2,
> > "sched_runtime": 16000000,
> > "sched_util_max": 512
> > },
> > "QOS_BACKGROUND": {
> > "sched_policy": "SCHED_BATCH",
> > "sched_nice": 4,
> > "sched_runtime": 20000000,
> > "sched_util_max": 256
> > },
> > "QOS_DEFAULT": {
> > "sched_policy": "SCHED_BATCH",
> > "sched_nice": 2,
> > "sched_runtime": 16000000,
> > "sched_util_max": 512
> > }
> > }
> >
> Is my understand correct that this is device-agnostic?
Ideally, yes.
> In particular sched_util_max seems very platform-dependent?
How come?
> Also these could very well all land on the same big cluster and be effectively
> void then.
Who said we are trying to avoid the big cluster? This setup is meaningful for
SMP systems too FWIW.
> And I don't think I generally buy the argument that uclamp_max is even a good
> generic way to save power, not with these fixed (and arbitrary) values (e.g.
Who said they are arbitrary? What is the right generic way to save power?
> 256 might be just the threshold to allow/require very inefficient OPPs that
> happen frequently, in particular with the instability of utilization values
> under sched_util_max / restricted compute capacity).
Have you actually encountered this scenario frequently? What was the power
impact?
Anyways. It seems you're jumping to a number of wrong conclusions.
Let me see if I can hopefully provide a helpful comprehensive answer.
The mapping is a simple config file that is meant to address as many systems as
possible so that it just works. But if there's a case where an admin thinks
they know better, it is a simple update to this config file locally with their
'fine tuned' values and `sudo schedqos restart --daemon` for the changes to
take effect. Not sure if you read USAGE.md, but running applications will see
the updated changes without having to restart them by simply restarting the
daemon. There's an option, --configs-path, to provide a path to your own
config files.
uclamp_max values are not arbitrary because we have 4 levels, hence the 25%
increment (100%/4). Since we are told USER_INTERACTIVE are the most impactful
tasks, they are the only ones allowed to reach max perf. Given the exponential
nature of power/perf curve; cutting these top frequencies will be tangible for
all tasks that are deemed 'tolerating of some delays'.
Can someone do better for a specific system and use case? Sure. Do we block
them? No, as already answered above.
We don't care about capping to any particular CPU. This is scheduler detail
that must be handled transparently. The power saving should be impactful still
compared to what you get today since now only very few select tasks can drive
the system to the top, and very expensive, performance level.
For the inefficient OPPs. It seems you're referring to potentially uclamp_max
not allowing us to go to next frequency if it is more efficient. I need to
reinspect the code to confirm this will happen, but assuming it do, I doubt
that 256 will frequently land on those frequencies. And when it does, it would
be NOT really bad in practice. Why? Because today tasks are free to consume max
frequencies and we are doing okay. By capping them to 25% and 50% this is
much better than what you get today. The power difference between efficient and
inefficient OPPs is surely much (much!) smaller than the difference of running
at max vs 50% and 25%. So it is a question of minor fine tuning rather than
spilling power left and right. Ie, it is still a very impactful generic setup.
As for the platform dependent, I hope the above answered it already, but if
not, we really don't need to be precise. The goal is to restrict lower classes
from accessing higher freqs. Cutting 25% of the performance level (ie
frequencies) at each class is good enough of a description to shift frequency
usage towards the lower ones on any system.
Evolving this to do better is not hard if you have better ideas. And this is
what we are enabling now. You can tweak these mappings and run experiments and
iterate very quickly easily. And if you have a better way to do the mapping, we
can just update it and let everyone else reap the benefit ;-)
For a starter, I think these 25% increments should do well initially and might
not require fancier logic. But time will tell I guess :-)
For the record, I am worried about USER_INITIATED being capped to 75%. But
I think it is important to deliver meaningful perf and power difference between
classes. Since iterating and improving is 'cheap', I'd rather push my luck
first and get data to demonstrate how we need to do better.
Go ahead and give it a go. And if you hit corner cases it fails, let us know so
we can look at how we can do better.