Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
From: Peter Zijlstra
Date: Mon May 11 2026 - 07:03:14 EST
On Mon, May 04, 2026 at 02:59:58AM +0100, Qais Yousef wrote:
> Provide a generic and extensible interface to describe arbitrary QoS
> tags to tell the kernel about specific behavior that is doesn't fall
> into the existing sched_attr.
>
> The interface is broken into three parts:
>
> * Type
> * Value
> * Cookie
>
> Type is an enum that should be give us enough space to extend (and
> deprecate) comfortably.
>
> Value is a signed 64bit number to allow for arbitrary high values.
>
> Cookie is to help group tasks selectively so that some QoS might want to
> operate on tasks per groups. A value of 0 indicates system wide.
>
> There are two anticipated users being discussed on the list.
>
> 1. Per task rampup multiplier to allow controlling how fast util rises,
> and by implication it can migrate between cores on HMP systems and
> cause freqs to rise with schedutil.
>
> 2. Tag a group of task that are memory dependent for Cache Aware
> Scheduling.
>
> The interface is anticipated to be provisioned to apps via utilities and
> libraries. schedqos [1] is an example how such interface can be used to
> provide higher level QoS abstraction to describe workloads without
> baking it into the binaries, and by implication without worrying about
> potential abuse. The interface requires privileged access since QoS is
> considered scarce resource and requires admin control to ensure it is
> set properly. Again that admin control is anticipated to be the schedqos
> utility service.
>
> QoS is treated as a scarce resource and the intention is for the
> a syscall to be done for each individual QoS tag. QoS tags are not
> inherited on fork by default too for the same reason.
>
> A reasonable point of debate is whether to make the sched_qos an array
> of 3 or 5 value to avoid potential bottleneck if this grows large and
> users do end up hitting a bottleneck of having to issue too many
> syscalls to set all QoS. Being limited as it is now helps enforce
> intentionality and scarcity of tagging.
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Scheduler QoS
> +=============
> +
> +1. Introduction
> +===============
> +
> +Different workloads have different scheduling requirements to operate
> +optimally. The same applies to tasks within the same workload.
> +
> +To enable smarter usage of system resources and to cater for the conflicting
> +demands of various tasks, Scheduler QoS provides a mechanism to provide more
> +information about those demands so that scheduler can do best-effort to
> +honour them.
> +
> + @sched_qos_type what QoS hint to apply
> + @sched_qos_value value of the QoS hint
> + @sched_qos_cookie magic cookie to tag a group of tasks for which the QoS
> + applies. If 0, the hint will apply globally system
> + wide. If not 0, the hint will be relative to tasks that
> + has the same cookie value only.
> +
> +QoS hints are set once and not inherited by children by design. The
> +rationale is that each task has its individual characteristics and it is
> +encouraged to describe each of these separately. Also since system resources
> +are finite, there's a limit to what can be done to honour these requests
> +before reaching a tipping point where there are too many requests for
> +a particular QoS that is impossible to service for all of them at once and
> +some will start to lose out. For example if 10 tasks require better wake
> +up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
> +4 can perceive the hint honoured and the rest will have to wait. Inheritance
> +can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
> +hint will lose its meaning and effectiveness rapidly. The chances of 10
> +tasks waking up at the same time is lower than a 100 and lower than a 1000.
> +
> +To set multiple QoS hints, a syscall is required for each. This is a
> +trade-off to reduce the churn on extending the interface as the hope for
> +this to evolve as workloads and hardware get more sophisticated and the
> +need for extension will arise; and when this happen the task should be
> +simpler to add the kernel extension and allow userspace to use readily by
> +setting the newly added flag without having to update the whole of
> +sched_attr.
So 'type' is effectively meant to be an ephemeral space of hints. A
kernel can, or can not, support this arbitrary set of hints.
If a particular type is supported across two kernels, it is assumed to
be the same -- although its implementation might be different.
Your next patch implements type-0 to be this pelt multiplier thing.
I wonder about discoverability, suppose we create and discard a fair
number of these types, just because. Then how is someone (this
muddle-ware component for example) to discover which set of hints is
supported by the kernel of the day?
I suppose it can go and scan the space, by trying to set hints on itself
or something, but that seems sub-optimal.