Re: IPC drop down on AMD epyc 7702P

From: K Prateek Nayak
Date: Mon May 05 2025 - 08:29:44 EST


Hello Vincent,

On 5/5/2025 3:58 PM, Vincent Guittot wrote:
On Wed, 30 Apr 2025 at 11:13, K Prateek Nayak<kprateek.nayak@xxxxxxx> wrote:
(+ more scheduler folks)

tl;dr

JB has a workload that hates aggressive migration on the 2nd Generation
EPYC platform that has a small LLC domain (4C/8T) and very noticeable
C2C latency.

Based on JB's observation so far, reverting commit 16b0a7a1a0af
("sched/fair: Ensure tasks spreading in LLC during LB") and commit
c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
condition") helps the workload. Both those commits allow aggressive
migrations for work conservation except it also increased cache
misses which slows the workload quite a bit.
commit 16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC
during LB") eases the spread of task inside a LLC so It's not obvious
for me how it would increase "a lot of CPU migrations go out of CCX,
then L3 miss,". On the other hand, it will spread task in SMT and in
LLC which can prevent running at highest freq on some system but I
don't know if it's relevant for this SoC.

I misspoke there. JB's workload seems to be sensitive even to core to
core migrations - "relax_domain_level=2" actually disabled newidle
balance above CLUSTER level which is a subset of MC on x86 and gets
degenerated into the SMT domain.


commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
condition") makes newly idle migration happen more often which can
then do migrate tasks across LLC. But then It's more about why
enabling newly idle load balance out of LLC if it is so costly.

It seems to be very workload + possibly platform specific
characteristic where re-priming the cache is actually very costly.
I'm not sure if there are any other uarch factors at play here that
require repriming (branch prediction, prefetcher, etc.) after a task
migration to reach same IPC.

Essentially "relax_domain_level" gets the desired characteristic
where only the periodic balance will balance long-term imbalance
but as Libo mentioned the short term imbalances can build up
and using "relax_domain_level" might lead to other problems.

Short of pinning / more analysis of which part of migrations make
the workload unhappy, I couldn't think of a better way to
communicate this requirement.


"relax_domain_level" helps but cannot be set at runtime and I couldn't
think of any stable / debug interfaces that JB hasn't tried out
already that can help this workload.

There is a patch towards the end to set "relax_domain_level" at
runtime but given cpusets got away with this when transitioning to
cgroup-v2, I don't know what the sentiments are around its usage.
Any input / feedback is greatly appreciated.

--
Thanks and Regards,
Prateek