Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)

From: Marcel Ziswiler
Date: Sun May 25 2025 - 15:29:30 EST


Hi Luca

On Fri, 2025-05-23 at 21:46 +0200, luca abeni wrote:
> Hi Marcel,
>
> sorry, but I have some additional questions to fully understand your
> setup...

No Problem, I am happy to answer any questions :)

> On Mon, 19 May 2025 15:32:27 +0200
> Marcel Ziswiler <marcel.ziswiler@xxxxxxxxxxxxxxx> wrote:
> [...]
> > > just a quick question to better understand your setup (and check
> > > where the issue comes from):
> > > in the email below, you say that tasks are statically assigned to
> > > cores; how did you do this? Did you use isolated cpusets, 
> >
> > Yes, we use the cpuset controller from the cgroup-v2 APIs in the
> > linux kernel in order to partition CPUs and memory nodes. In detail,
> > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice
> > configurations.
>
> How do you configure systemd? I am having troubles in reproducing your
> AllowedCPUs configuration... This is an example of what I am trying:
> sudo systemctl set-property --runtime custom-workload.slice AllowedCPUs=1
> sudo systemctl set-property --runtime init.scope AllowedCPUs=0,2,3
> sudo systemctl set-property --runtime system.slice AllowedCPUs=0,2,3
> sudo systemctl set-property --runtime user.slice AllowedCPUs=0,2,3
> and then I try to run a SCHED_DEADLINE application with
> sudo systemd-run --scope -p Slice=custom-workload.slice <application>

We just use a bunch of systemd configuration files as follows:

[root@localhost ~]# cat /lib/systemd/system/monitor.slice
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only
[Unit]
Description=Prioritized slice for the safety monitor.
Before=slices.target

[Slice]
CPUWeight=1000
AllowedCPUs=0
MemoryAccounting=true
MemoryMin=10%
ManagedOOMPreference=omit

[Install]
WantedBy=slices.target

[root@localhost ~]# cat /lib/systemd/system/safety1.slice
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only
[Unit]
Description=Slice for Safety case processes.
Before=slices.target

[Slice]
CPUWeight=1000
AllowedCPUs=1
MemoryAccounting=true
MemoryMin=10%
ManagedOOMPreference=omit

[Install]
WantedBy=slices.target

[root@localhost ~]# cat /lib/systemd/system/safety2.slice
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only
[Unit]
Description=Slice for Safety case processes.
Before=slices.target

[Slice]
CPUWeight=1000
AllowedCPUs=2
MemoryAccounting=true
MemoryMin=10%
ManagedOOMPreference=omit

[Install]
WantedBy=slices.target

[root@localhost ~]# cat /lib/systemd/system/safety3.slice
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only
[Unit]
Description=Slice for Safety case processes.
Before=slices.target

[Slice]
CPUWeight=1000
AllowedCPUs=3
MemoryAccounting=true
MemoryMin=10%
ManagedOOMPreference=omit

[Install]
WantedBy=slices.target

[root@localhost ~]# cat /lib/systemd/system/system.slice
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only

#
# This slice will control all processes started by systemd by
# default.
#

[Unit]
Description=System Slice
Documentation=man:systemd.special(7)
Before=slices.target

[Slice]
CPUQuota=150%
AllowedCPUs=0
MemoryAccounting=true
MemoryMax=80%
ManagedOOMSwap=kill
ManagedOOMMemoryPressure=kill

[root@localhost ~]# cat /lib/systemd/system/user.slice
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only

#
# This slice will control all processes started by systemd-logind
#

[Unit]
Description=User and Session Slice
Documentation=man:systemd.special(7)
Before=slices.target

[Slice]
CPUQuota=25%
AllowedCPUs=0
MemoryAccounting=true
MemoryMax=80%
ManagedOOMSwap=kill
ManagedOOMMemoryPressure=kill

> However, this does not work because systemd is not creating an isolated
> cpuset... So, the root domain still contains CPUs 0-3, and the
> "custom-workload.slice" cpuset only has CPU 1. Hence, the check
>                         /*
>                          * Don't allow tasks with an affinity mask smaller than
>                          * the entire root_domain to become SCHED_DEADLINE. We
>                          * will also fail if there's no bandwidth available.
>                          */
>                         if (!cpumask_subset(span, p->cpus_ptr) ||
>                             rq->rd->dl_bw.bw == 0) {
>                                 retval = -EPERM;
>                                 goto unlock;
>                         }
> in sched_setsched() fails.
>
>
> How are you configuring the cpusets?

See above.

> Also, which kernel version are you using?
> (sorry if you already posted this information in previous emails and I am
> missing something obvious)

Not even sure, whether I explicitly mentioned that other than that we are always running latest stable.

Two months ago when we last run some extensive tests on this it was actually v6.13.6.

> Thanks,

Thank you!

> Luca

Cheers

Marcel