Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

From: Balbir Singh

Date: Sun Mar 29 2026 - 17:36:21 EST

On 3/29/26 09:50, Andrea Righi wrote:
> Hi Balbir,
>
> On Sun, Mar 29, 2026 at 12:03:19AM +1100, Balbir Singh wrote:
>> On 3/27/26 02:02, Andrea Righi wrote:
>>> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
>>> introducing SMT awareness.
>>>
>>> = Problem =
>>>
>>> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
>>> sibling is busy, because the physical core doesn't deliver its full nominal
>>> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
>>> that are not actually good destinations.
>>>
>>> = Proposed Solution =
>>>
>>> This patch set aligns those paths with a simple rule already used
>>> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
>>> partially idle SMT siblings as full-capacity targets where that would
>>> mislead load balance.
>>
>> In kernel/sched/topology.c
>>
>> /* Don't attempt to spread across CPUs of different capacities. */
>> if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
>> sd->child->flags &= ~SD_PREFER_SIBLING;
>>
>> Should handle the selection, but I guess this does not work for SMT level sd's?
>
> IIUC, SD_PREFER_SIBLING steers load balance toward sibling_imbalance()
> (spread runnables across child/sibling domains), it doesn't encode the
> fully-idle core first logic. In practice it doesn't give us SMT-aware
> destination choice when a sibling is busy and this series is trying to
> cover that gap in the palcement path.
>

Thanks, so we care about idle selection, not necessarily balancing and yes I did
see that sd->child needs to be set for SD_PEFER_SIBLING to be cleared.

> BTW, on Vera the hierarchy is SMT -> MC -> NUMA:
>
> root@localhost:~# grep . /sys/kernel/debug/sched/domains/cpu0/domain*/flags
> /sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
> /sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_SHARE_LLC
> /sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SERIALIZE SD_NUMA
>
> And domain1/groups_flags (child / SMT flags on the sched groups used at the
> MC level) still has SD_PREFER_SIBLING together with SD_SHARE_CPUCAPACITY.
>
> root@localhost:~# cat /sys/kernel/debug/sched/domains/cpu0/domain1/groups_flags
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
>
> So, prefer-sibling is still in play for SMT (including via MC
> groups_flags). On machines where asymmetry attaches immediately above SMT,
> topology may strip that flag and reduce this branch of behavior, but
> explicit SMT-aware placement still matters.
>
>>>
>>> Patch set summary:
>>>
>>> - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
>>>
>>> Prefer fully-idle SMT cores in asym-capacity idle selection. In the
>>> wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
>>> idle selection can prefer CPUs on fully idle cores, with a safe fallback.
>>>
>>> - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
>>>
>>> Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
>>> Provided for consistency with PATCH 1/4.
>>>
>>> - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
>>>
>>> Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
>>> consistency with PATCH 1/4. I've also tested with/without
>>> /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
>>> noticed any regression.
>>>
>>> - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
>>>
>>> When choosing the housekeeping CPU that runs the idle load balancer,
>>> prefer an idle CPU on a fully idle core so migrated work lands where
>>> effective capacity is available.
>>>
>>> The change is still consistent with the same "avoid CPUs with busy
>>> sibling" logic and it shows some benefits on Vera, but could have
>>> negative impact on other systems, I'm including it for completeness
>>> (feedback is appreciated).
>>>
>>> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
>>> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
>>> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
>>>
>>
>> Are you referring to nominal_freq?
>>
>
> Correct.
>

Thanks,
Balbir