Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width

From: Zhang Qiao

Date: Mon May 18 2026 - 04:03:05 EST

在 2026/5/11 23:54, Dietmar Eggemann 写道:
> On 29.04.26 04:43, Zhang Qiao wrote:
>>
>> Hi,
>>
>> 在 2026/4/22 21:26, Dietmar Eggemann 写道:
>>> On 16.04.26 09:41, Zhang Qiao wrote:
>>>> Hi Shrikanth,
>>>>
>>>> 在 2026/4/8 1:58, Shrikanth Hegde 写道:
>>>>> Hi.
>>>>>
>>>>> On 4/7/26 12:09 PM, Zhang Qiao wrote:
>
> [...]
>
>>>> The workload is a producer-consumer model: one producer wakes up ~50
>>>> different consumers, with roughly 10+ consumers running concurrently.
>>>> The total number of tasks is well below the CPU count.
>>>
>>> But higher than your MC core count I believe? Otherwise you wouldn't
>>> care. I assume you have MC CPU count of 12-24. Do you have more than 2
>>> different MCs.
>>
>> My server has 10 different MCs (LLCs), with each MC containing 8 physical cores
>> (16 threads with SMT-2).
>
> Thanks.
>
>>>> In this scenario, load balancing is largely ineffective. Each consumer
>>>> spends most of its time sleeping, gets woken by the producer, runs
>>>> briefly to process the message, then goes back to sleep. There is
>>>> almost no window where a consumer sits on a CPU runqueue in the runnable
>>>> state waiting to be pulled. Since load balancing can only migrate
>>>> runnable tasks, it simply has no target to act on here.
>>>
>>> OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a
>>
>> SD_BALANCE_WAKE was not enabled in my tests.
>
> Right, looks like I mixed up balance flags & fast/slow path with the
> wake affine vs. wake wide logic.
>
>>> difference in behaviour on an SMT machine in terms of waking tasks wide,
>>> i.e. going through the slow path. Like I tried to explain in the
>>> adjacent thread, your wakees would only end up in the slow path in case
>>> your sched domains would have SD_BALANCE_WAKE set.>
>>> Or do you just want to force wakeups which have wake_wide(p) return 1
>>> always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't
>>> be wake wide?
>>
>> The observed improvement comes from suppressing wake_affine() before it
>> pulls wakees onto the waker's physical core. In the producer-consumer
>> workload, without this patch, consumers are repeatedly affined into the
>> waker's LLC and end up co-scheduled on the same physical core's SMT
>> siblings. With the patch, wake_wide() fires earlier and wakees are left
>> on prev_cpu, resulting in better spread across physical cores.
>
> Makes sense.
>
> You mentioned having ~10+ consumers running concurrently. I’m curious
> why select_idle_sibling() isn’t doing a better job of distributing those
> tasks across idle cores, even though wakeups are affine to the waker and
> its LLC domain. Is this because you only have 8 cores per LLC, combined
> with general system noise?

Yes, exactly. Each LLC has only 8 physical cores (16 threads with SMT-2).
When more than 8 consumers are woken into the same LLC domain, the number
of running tasks exceeds the physical core count, and SMT siblings are
forced to share execution resources, causing the interference we observed.

Thanks,
Zhang Qiao

>
> .
>