Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention

From: Peter Zijlstra

Date: Fri Mar 20 2026 - 06:09:20 EST

On Mon, Jul 21, 2025 at 02:10:23PM +0800, Pan Deng wrote:
> When running a multi-instance FFmpeg workload on an HCC system, significant
> cache line contention is observed around `cpupri_vec->count` and `mask` in
> struct root_domain.
>
> The SUT is a 2-socket machine with 240 physical cores and 480 logical
> CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> with FIFO scheduling. FPS is used as score.
>
> perf c2c tool reveals:
> root_domain cache line 3:
> - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
> and contends with other fields, since counts[0] is more frequently
> updated than others along with a rt task enqueues an empty runq or
> dequeues from a non-overloaded runq.
> - cycles per load: ~10K to 59K
>
> cpupri's last cache line:
> - `cpupri_vec->count` and `mask` contends. The transcoding threads use
> rt pri 99, so that the contention occurs in the end.
> - cycles per load: ~1.5K to 10.5K
>
> This change mitigates `cpupri_vec->count`, `mask` related contentions by
> separating each count and mask into different cache lines.

Right.

> Note: The side effect of this change is that struct cpupri size is
> increased from 26 cache lines to 203 cache lines.

That is pretty horrible, but probably unavoidable.

> An alternative implementation of this patch could be separating `counts`
> and `masks` into 2 vectors in cpupri_vec (counts[] and masks[]), and
> add two paddings:
> 1. Between counts[0] and counts[1], since counts[0] is more frequently
> updated than others.

That is completely workload specific; it is a direct consequence of your
(probably busted) priority assignment scheme.

> 2. Between the two vectors, since counts[] is read-write access while
> masks[] is read access when it stores pointers.
>
> The alternative introduces the complexity of 31+/21- LoC changes,
> it achieves almost the same performance, at the same time, struct cpupri
> size is reduced from 26 cache lines to 21 cache lines.

That is not an alternative, since it very specifically only deals with
fifo-99 contention.

> ---
> kernel/sched/cpupri.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index d6cba0020064..245b0fa626be 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -9,7 +9,7 @@
>
> struct cpupri_vec {
> atomic_t count;
> - cpumask_var_t mask;
> + cpumask_var_t mask ____cacheline_aligned;
> };

At the very least this needs a comment, explaining the what and how of
it.