Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

From: Andrea Righi

Date: Thu Mar 19 2026 - 10:01:31 EST


Hi Christian,

On Thu, Mar 19, 2026 at 11:58:39AM +0000, Christian Loehle wrote:
> On 3/18/26 17:09, Andrea Righi wrote:
> > Hi Christian,
> >
> > On Wed, Mar 18, 2026 at 03:43:26PM +0000, Christian Loehle wrote:
> >> On 3/18/26 10:31, Andrea Righi wrote:
> >>> Hi Vincent,
> >>>
> >>> On Wed, Mar 18, 2026 at 10:41:15AM +0100, Vincent Guittot wrote:
> >>>> On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> >>>>> different per-core frequencies), the wakeup path uses
> >>>>> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> >>>>> for better task placement. However, when those CPUs belong to SMT cores,
> >>>>
> >>>> Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
> >>>> ? I thought both were never set simultaneously and SD_ASYM_PACKING was
> >>>> used for system involving SMT like x86
> >>>
> >>> It's an NVIDIA platform (not publicly available yet), where the firmware
> >>> exposes different CPU capacities and has SMT enabled, so both
> >>> SD_ASYM_CPUCAPACITY and SMT are present. I'm not sure whether the final
> >>> firmware release will keep this exact configuration (there's a good chance
> >>> it will), so I'm targeting it to be prepared.
> >>
> >>
> >> Andrea,
> >> that makes me think, I've played with a nvidia grace available to me recently,
> >> which sets slightly different CPPC highest_perf values (~2%) which automatically
> >> will set SD_ASYM_CPUCAPACITY and run the entire capacity-aware scheduling
> >> machinery for really almost negligible capacity differences, where it's
> >> questionable how sensible that is.
> >
> > That looks like the same system that I've been working with. I agree that
> > treating small CPPC differences as full asymmetry can be a bit overkill.
> >
> > I've been experimenting with flattening the capacities (to force the
> > "regular" idle CPU selection policy), which performs better than the
> > current asym-capacity CPU selection. However, adding the SMT awareness to
> > the asym-capacity, seems to give a consistent +2-3% (same set of
> > CPU-intensive benchmarks) compared to flatening alone, which is not bad.
> >
> >> I have an arm64 + CPPC implementation for asym-packing for this machine, maybe
> >> we can reuse that for here too?
> >
> > Sure, that sounds interesting, if it's available somewhere I'd be happy to
> > do some testing.
> >
> Hi Andrea,
>
> I will clean up the asympacking code a bit and share it with you for testing.
>
> Interestingly, when we looked at DCPerf MediaWiki, we found the exact opposite.
>
> On NVIDIA Grace, enabling CAS due to the small CPPC highest_perf differences was
> actually beneficial for the workload. More interestingly, we saw a similar uplift
> on a different arm64 server without ASYM_CPUCAPACITY when we force-enabled
> sched_asym_cpucap_active() even though the system was highest_perf-symmetric.
> That suggests the uplift on Grace may have come from CAS-specific behavior rather
> than from better selection of the highest_perf CPUs.

What NVIDIA Grace in particular? On GB300 ASYM_CPUCAPACITY seems to be
enabled. I can try to disable / equalize the capacities and repeat the test
there as well.

>
> I'd be very curious whether something similar (i.e. the inverse) is happening in your
> case as well, i.e. flattening the capacities but still forcing
> select_idle_sibling() / sched_asym_cpucap_active() despite equal capacities. Of course,
> that will also depend on the workloads (what are you testing?)

I can definitely try that. I'm using an internal benchmark suite, in
particular the benchmark that is showing the bigger improvements is based
on the NVBLAS library (but using the CPUs, not the GPUs), not sure if it's
publicly available, I'll check.

>
> Just to illustrate, below is one example where CAS improved both score and CPU utilization:
> +--------------------------+----------------------+-------------------------+-----------------------------------------+
> | Platform | default (v6.8) | force all CPUs = 1024 | force sched_asym_cpucap_active() = TRUE |
> +--------------------------+----------------------+-------------------------+-----------------------------------------+
> | arm64 symmetric (72 CPUs)| 100% (90% CPU util) | ------------- | 104.26% (99%) |
> | Grace (72 CPUs) | 100% (99%) | 99.49% (90%) | ------------- |
> +--------------------------+----------------------+-------------------------+-----------------------------------------+

I see, interesting. Now I'm curious to do the opposite on the GB300 that I
have access to, flattening the capacities to 1024 and see what I get.

Thanks,
-Andrea