Re: [RFC patch v3 00/20] Cache aware scheduling

From: Chen, Yu C
Date: Tue Jun 24 2025 - 08:22:03 EST



On 6/24/2025 1:00 PM, K Prateek Nayak wrote:
Hello Tim,

On 6/18/2025 11:57 PM, Tim Chen wrote:
AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
with 1 group test scenario benefits from cache aware load balance
too:

hackbench(1 group and fd ranges in [1,6]:
case                    load            baseline(std%)  compare%( std%)
threads-pipe-1          1-groups         1.00 (  1.22)   +2.84 (  0.51)
threads-pipe-2          1-groups         1.00 (  5.82)  +42.82 ( 43.61)
threads-pipe-3          1-groups         1.00 (  3.49)  +17.33 ( 18.68)
threads-pipe-4          1-groups         1.00 (  2.49)  +12.49 (  5.89)
threads-pipe-5          1-groups         1.00 (  1.46)   +8.62 (  4.43)
threads-pipe-6          1-groups         1.00 (  2.83)  +12.73 (  8.94)
threads-sockets-1       1-groups         1.00 (  1.31)  +28.68 (  2.25)
threads-sockets-2       1-groups         1.00 (  5.17)  +34.84 ( 36.90)
threads-sockets-3       1-groups         1.00 (  1.57)   +9.15 (  5.52)
threads-sockets-4       1-groups         1.00 (  1.99)  +16.51 (  6.04)
threads-sockets-5       1-groups         1.00 (  2.39)  +10.88 (  2.17)
threads-sockets-6       1-groups         1.00 (  1.62)   +7.22 (  2.00)

Besides a single instance of hackbench, four instances of hackbench are
also tested on Milan. The test results show that different instances of
hackbench are aggregated to dedicated LLCs, and performance improvement
is observed.

schbench mmtests(unstable)
                                   baseline              nowake_lb
Lat 50.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
Lat 90.0th-qrtle-1        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.0th-qrtle-1        16.00 (   0.00%)       14.00 (  12.50%)
Lat 99.9th-qrtle-1        22.00 (   0.00%)       21.00 (   4.55%)
Lat 20.0th-qrtle-1       759.00 (   0.00%)      759.00 (   0.00%)
Lat 50.0th-qrtle-2         9.00 (   0.00%)        7.00 (  22.22%)
Lat 90.0th-qrtle-2        12.00 (   0.00%)       12.00 (   0.00%)
Lat 99.0th-qrtle-2        16.00 (   0.00%)       15.00 (   6.25%)
Lat 99.9th-qrtle-2        22.00 (   0.00%)       21.00 (   4.55%)
Lat 20.0th-qrtle-2      1534.00 (   0.00%)     1510.00 (   1.56%)
Lat 50.0th-qrtle-4         8.00 (   0.00%)        9.00 ( -12.50%)
Lat 90.0th-qrtle-4        12.00 (   0.00%)       12.00 (   0.00%)
Lat 99.0th-qrtle-4        15.00 (   0.00%)       16.00 (  -6.67%)
Lat 99.9th-qrtle-4        21.00 (   0.00%)       23.00 (  -9.52%)
Lat 20.0th-qrtle-4      3076.00 (   0.00%)     2860.00 (   7.02%)
Lat 50.0th-qrtle-8        10.00 (   0.00%)        9.00 (  10.00%)
Lat 90.0th-qrtle-8        12.00 (   0.00%)       13.00 (  -8.33%)
Lat 99.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
Lat 99.9th-qrtle-8        22.00 (   0.00%)       24.00 (  -9.09%)
Lat 20.0th-qrtle-8      6232.00 (   0.00%)     5896.00 (   5.39%)
Lat 50.0th-qrtle-16        9.00 (   0.00%)        9.00 (   0.00%)
Lat 90.0th-qrtle-16       13.00 (   0.00%)       13.00 (   0.00%)
Lat 99.0th-qrtle-16       17.00 (   0.00%)       18.00 (  -5.88%)
Lat 99.9th-qrtle-16       23.00 (   0.00%)       26.00 ( -13.04%)
Lat 20.0th-qrtle-16    10096.00 (   0.00%)    10352.00 (  -2.54%)
Lat 50.0th-qrtle-32       15.00 (   0.00%)       15.00 (   0.00%)
Lat 90.0th-qrtle-32       25.00 (   0.00%)       26.00 (  -4.00%)
Lat 99.0th-qrtle-32       49.00 (   0.00%)       50.00 (  -2.04%)
Lat 99.9th-qrtle-32      945.00 (   0.00%)     1005.00 (  -6.35%)
Lat 20.0th-qrtle-32    11600.00 (   0.00%)    11632.00 (  -0.28%)

Netperf/Tbench have not been tested yet. As they are single-process
benchmarks that are not the target of this cache-aware scheduling.
Additionally, client and server components should be tested on
different machines or bound to different nodes. Otherwise,
cache-aware scheduling might harm their performance: placing client
and server in the same LLC could yield higher throughput due to
improved cache locality in the TCP/IP stack, whereas cache-aware
scheduling aims to place them in dedicated LLCs.

I have similar observation from my testing.


Prateek, thanks for your test.

tl;dr

o Benchmark that prefer co-location and run in threaded mode see
  a benefit including hackbench at high utilization and schbench
  at low utilization.


Previously, we tested hackbench with one group using different
fd pairs. The number of fds (1–6) was lower than the number
of CPUs (8) within one CCX. If I understand correctly, the
default number of fd pairs in hackbench is 20. We might need
to handle cases where the number of threads (nr_thread)
exceeds the number of CPUs per LLC—perhaps by
skipping task aggregation in such scenarios.

o schbench (both new and old but particularly the old) regresses
  quite a bit on the tial latency metric when #workers cross the
  LLC size.


As mentioned above, maybe re-consider the nr_thread vs nr_cpus_per_llc
could mitigate the issue. Besides, maybe introduce a rate limit
for cache aware aggregation would help.

o client-server benchmarks where client and servers are threads
  from different processes (netserver-netperf, tbench_srv-tbench,
  services of DeathStarBench) seem to noticeably regress due to
  lack of co-location between the communicating client and server.

  Not sure if WF_SYNC can be an indicator to temporarily ignore
  the preferred LLC hint.

WF_SYNC is used in wakeup path, the current v3 version does the
task aggregation in the load balance path. We'll look into this
C/S scenario.


o stream regresses in some runs where the occupancy metrics trip
  and assign a preferred LLC for all the stream threads bringing
  down performance in !50% of the runs.


May I know if you tested the stream with mmtests under OMP mode,
and what do stream-10 and stream-100 mean? Stream is an example
where all threads have their private memory buffers—no
interaction with each other. For this benchmark, spreading
them across different Nodes gets higher memory bandwidth because
stream allocates the buffer to be at least 4X the L3 cache size.
We lack a metric that can indicate when threads share a lot of
data (e.g., both Thread 1 and Thread 2 read from the same
buffer). In such cases, we should aggregate the threads;
otherwise, do not aggregate them (as in the stream case).
On the other hand, stream-omp seems like an unrealistic
scenario—if threads do not share buffer, why create them
in the same process?


Full data from my testing is as follows:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel details

tip:      tip:sched/core at commit 914873bc7df9 ("Merge tag
           'x86-build-2025-05-25' of
           git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

llc-aware-lb-v3: tip + this series as is

o Benchmark results

    ==================================================================
    Test          : hackbench
    Units         : Normalized time in seconds
    Interpretation: Lower is better
    Statistic     : AMean
    ==================================================================
    Case:           tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
     1-groups     1.00 [ -0.00](13.74)     1.03 [ -2.77](12.01)
     2-groups     1.00 [ -0.00]( 9.58)     1.02 [ -1.78]( 6.12)
     4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -0.87]( 0.91)
     8-groups     1.00 [ -0.00]( 1.51)     1.03 [ -3.31]( 2.06)
    16-groups     1.00 [ -0.00]( 1.10)     0.95 [  5.36]( 1.67)


    ==================================================================
    Test          : tbench
    Units         : Normalized throughput
    Interpretation: Higher is better
    Statistic     : AMean
    ==================================================================
    Clients:    tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
        1     1.00 [  0.00]( 0.82)     0.96 [ -3.68]( 1.23)
        2     1.00 [  0.00]( 1.13)     0.98 [ -2.30]( 0.51)
        4     1.00 [  0.00]( 1.12)     0.96 [ -4.14]( 0.22)
        8     1.00 [  0.00]( 0.93)     0.96 [ -3.61]( 0.46)
       16     1.00 [  0.00]( 0.38)     0.95 [ -4.98]( 1.26)
       32     1.00 [  0.00]( 0.66)     0.93 [ -7.12]( 2.22)
       64     1.00 [  0.00]( 1.18)     0.95 [ -5.44]( 0.37)
      128     1.00 [  0.00]( 1.12)     0.93 [ -6.78]( 0.64)
      256     1.00 [  0.00]( 0.42)     0.94 [ -6.45]( 0.47)
      512     1.00 [  0.00]( 0.14)     0.93 [ -7.26]( 0.27)
     1024     1.00 [  0.00]( 0.26)     0.92 [ -7.57]( 0.31)


    ==================================================================
    Test          : stream-10
    Units         : Normalized Bandwidth, MB/s
    Interpretation: Higher is better
    Statistic     : HMean
    ==================================================================
    Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
     Copy     1.00 [  0.00]( 8.37)     0.39 [-61.05](44.88)
    Scale     1.00 [  0.00]( 2.85)     0.43 [-57.26](40.60)
      Add     1.00 [  0.00]( 3.39)     0.40 [-59.88](42.02)
    Triad     1.00 [  0.00]( 6.39)     0.41 [-58.93](42.98)


    ==================================================================
    Test          : stream-100
    Units         : Normalized Bandwidth, MB/s
    Interpretation: Higher is better
    Statistic     : HMean
    ==================================================================
    Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
     Copy     1.00 [  0.00]( 3.91)     0.36 [-63.95](51.04)
    Scale     1.00 [  0.00]( 4.34)     0.40 [-60.31](43.12)
      Add     1.00 [  0.00]( 4.14)     0.38 [-62.46](43.40)
    Triad     1.00 [  0.00]( 1.00)     0.36 [-64.38](43.12)


    ==================================================================
    Test          : netperf
    Units         : Normalized Througput
    Interpretation: Higher is better
    Statistic     : AMean
    ==================================================================
    Clients:         tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
     1-clients     1.00 [  0.00]( 0.41)     0.97 [ -3.26]( 1.30)
     2-clients     1.00 [  0.00]( 0.58)     0.96 [ -4.24]( 0.71)
     4-clients     1.00 [  0.00]( 0.35)     0.96 [ -4.19]( 0.67)
     8-clients     1.00 [  0.00]( 0.48)     0.95 [ -5.41]( 1.36)
    16-clients     1.00 [  0.00]( 0.66)     0.95 [ -5.31]( 0.93)
    32-clients     1.00 [  0.00]( 1.15)     0.94 [ -6.43]( 1.44)
    64-clients     1.00 [  0.00]( 1.38)     0.93 [ -7.14]( 1.63)
    128-clients    1.00 [  0.00]( 0.87)     0.89 [-10.62]( 0.78)
    256-clients    1.00 [  0.00]( 5.36)     0.92 [ -8.04]( 2.64)
    512-clients    1.00 [  0.00](54.39)     0.88 [-12.12](48.87)


    ==================================================================
    Test          : schbench
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      1     1.00 [ -0.00]( 8.54)     0.54 [ 45.65](28.79)
      2     1.00 [ -0.00]( 1.15)     0.56 [ 44.00]( 2.09)
      4     1.00 [ -0.00](13.46)     0.67 [ 33.33](35.68)
      8     1.00 [ -0.00]( 7.14)     0.63 [ 36.84]( 4.28)
     16     1.00 [ -0.00]( 3.49)     1.05 [ -5.08]( 9.13)
     32     1.00 [ -0.00]( 1.06)    32.04 [-3104.26](81.31)
     64     1.00 [ -0.00]( 5.48)    24.51 [-2351.16](81.18)
    128     1.00 [ -0.00](10.45)    14.56 [-1356.07]( 5.35)
    256     1.00 [ -0.00](31.14)     0.95 [  4.80](20.88)
    512     1.00 [ -0.00]( 1.52)     1.00 [ -0.25]( 1.26)


    ==================================================================
    Test          : new-schbench-requests-per-second
    Units         : Normalized Requests per second
    Interpretation: Higher is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      1     1.00 [  0.00]( 1.07)     0.97 [ -3.24]( 0.98)
      2     1.00 [  0.00]( 0.00)     0.99 [ -1.17]( 0.15)
      4     1.00 [  0.00]( 0.00)     0.96 [ -3.50]( 0.56)
      8     1.00 [  0.00]( 0.15)     0.98 [ -1.76]( 0.31)
     16     1.00 [  0.00]( 0.00)     0.94 [ -6.13]( 1.93)
     32     1.00 [  0.00]( 3.41)     0.97 [ -3.18]( 2.10)
     64     1.00 [  0.00]( 1.05)     0.82 [-18.14](18.41)
    128     1.00 [  0.00]( 0.00)     0.98 [ -2.27]( 0.20)
    256     1.00 [  0.00]( 0.72)     1.01 [  1.23]( 0.31)
    512     1.00 [  0.00]( 0.57)     1.00 [  0.00]( 0.12)


    ==================================================================
    Test          : new-schbench-wakeup-latency
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      1     1.00 [ -0.00]( 9.11)     0.88 [ 12.50](11.92)
      2     1.00 [ -0.00]( 0.00)     0.86 [ 14.29](11.92)
      4     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 4.08)
      8     1.00 [ -0.00]( 0.00)     0.83 [ 16.67]( 5.34)
     16     1.00 [ -0.00]( 7.56)     0.85 [ 15.38]( 0.00)
     32     1.00 [ -0.00](15.11)     0.80 [ 20.00]( 4.19)
     64     1.00 [ -0.00]( 9.63)     1.05 [ -5.00](24.47)
    128     1.00 [ -0.00]( 4.86)     1.57 [-56.78](68.52)
    256     1.00 [ -0.00]( 2.34)     1.00 [ -0.00]( 0.57)
    512     1.00 [ -0.00]( 0.40)     1.00 [ -0.00]( 0.34)


    ==================================================================
    Test          : new-schbench-request-latency
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      1     1.00 [ -0.00]( 2.73)     1.06 [ -5.71]( 0.25)
      2     1.00 [ -0.00]( 0.87)     1.08 [ -8.37]( 0.78)
      4     1.00 [ -0.00]( 1.21)     1.09 [ -9.15]( 0.79)
      8     1.00 [ -0.00]( 0.27)     1.06 [ -6.31]( 0.51)
     16     1.00 [ -0.00]( 4.04)     1.85 [-84.55]( 5.11)
     32     1.00 [ -0.00]( 7.35)     1.52 [-52.16]( 0.83)
     64     1.00 [ -0.00]( 3.54)     1.06 [ -5.77]( 2.62)
    128     1.00 [ -0.00]( 0.37)     1.09 [ -9.18](28.47)
    256     1.00 [ -0.00]( 9.57)     0.99 [  0.60]( 0.48)
    512     1.00 [ -0.00]( 1.82)     1.03 [ -2.80]( 1.16)


    ==================================================================
    Test          : Various longer running benchmarks
    Units         : %diff in throughput reported
    Interpretation: Higher is better
    Statistic     : Median
    ==================================================================
    Benchmarks:                  %diff
    ycsb-cassandra              -0.99%
    ycsb-mongodb                -0.96%
    deathstarbench-1x           -2.09%
    deathstarbench-2x           -0.26%
    deathstarbench-3x           -3.34%
    deathstarbench-6x           -3.03%
    hammerdb+mysql 16VU         -2.15%
    hammerdb+mysql 64VU         -3.77%


This patch set is applied on v6.15 kernel.
There are some further work needed for future versions in this
patch set.  We will need to align NUMA balancing with LLC aggregations
such that LLC aggregation will align with the preferred NUMA node.

Comments and tests are much appreciated.

I'll rerun the test once with the SCHED_FEAT() disabled just to make
sure I'm not regressing because of some other factors. For the major
regressions, I'll get the "perf sched stats" data to see if anything
stands out.

It seems that task migration and task bouncing between its preferred
LLC and non-preferred LLC is one symptom that caused regression.

thanks,
Chenyu


I'm also planning on getting the data from a Zen5c system with larger
LLC to see if there is any difference in the trend (I'll start with the
microbenchmarks since setting the larger ones will take some time)

Sorry for the lack of engagement on previous versions but I plan on
taking a better look at the series this time around. If you need any
specific data from my setup, please do let me know.