Re: [PATCH v18 0/8] Single RunQueue Proxy Execution (v18)
From: K Prateek Nayak
Date: Thu Jun 26 2025 - 23:05:11 EST
Hello John,
On 6/26/2025 2:00 AM, John Stultz wrote:
Hey All,
After not getting much response from the v17 series (and
resending it), I was going to continue to just iterate resending
the v17 single runqueue focused series. However, Suleiman had a
very good suggestion for improving the larger patch series and a
few of the tweaks for those changes trickled back into the set
I’m submitting here.
Unfortunately those later changes also uncovered some stability
problems with the full proxy-exec patch series, which took a
painfully long time (stress testing taking 30-60 hours to trip
the problem) to resolve. However, after finally sorting those
issues out it has been running well, so I can now send out the
next revision (v18) of the set.
So here is v18 of the Proxy Execution series, a generalized form
of priority inheritance.
Sorry for the lack of response on the previous version but here
are the test results for v18.
tl;dr I don't see anything major. Few regressions I see are for
data points with lot of deviation so I think they can be safely
ignored.
Full results are below:
o Machine details
- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)
o Kernel details
tip: tip:sched/urgentat commit 914873bc7df9 ("Merge tag
'x86-build-2025-05-25' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
proxy_exec: tip + this series as is with CONFIG_SCHED_PROXY_EXEC=y
o Benchmark results
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) proxy_exec[pct imp](CV)
1-groups 1.00 [ -0.00](13.74) 1.03 [ -3.20]( 8.80)
2-groups 1.00 [ -0.00]( 9.58) 1.04 [ -4.45]( 6.58)
4-groups 1.00 [ -0.00]( 2.10) 1.02 [ -2.17]( 1.85)
8-groups 1.00 [ -0.00]( 1.51) 0.99 [ 1.42]( 1.47)
16-groups 1.00 [ -0.00]( 1.10) 1.00 [ 0.42]( 1.23)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) proxy_exec[pct imp](CV)
1 1.00 [ 0.00]( 0.82) 1.02 [ 1.78]( 1.06)
2 1.00 [ 0.00]( 1.13) 1.03 [ 3.30]( 1.05)
4 1.00 [ 0.00]( 1.12) 1.02 [ 1.86]( 1.05)
8 1.00 [ 0.00]( 0.93) 1.02 [ 1.74]( 0.72)
16 1.00 [ 0.00]( 0.38) 1.02 [ 2.28]( 1.35)
32 1.00 [ 0.00]( 0.66) 1.01 [ 1.44]( 0.85)
64 1.00 [ 0.00]( 1.18) 1.02 [ 1.98]( 1.28)
128 1.00 [ 0.00]( 1.12) 1.00 [ 0.31]( 0.89)
256 1.00 [ 0.00]( 0.42) 1.00 [ -0.49]( 0.91)
512 1.00 [ 0.00]( 0.14) 1.01 [ 0.94]( 0.33)
1024 1.00 [ 0.00]( 0.26) 1.01 [ 0.95]( 0.24)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) proxy_exec[pct imp](CV)
Copy 1.00 [ 0.00]( 8.37) 0.98 [ -2.35]( 8.36)
Scale 1.00 [ 0.00]( 2.85) 0.93 [ -7.21]( 7.24)
Add 1.00 [ 0.00]( 3.39) 0.93 [ -7.50]( 6.56)
Triad 1.00 [ 0.00]( 6.39) 1.04 [ 4.18]( 7.77)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) proxy_exec[pct imp](CV)
Copy 1.00 [ 0.00]( 3.91) 1.02 [ 2.00]( 2.92)
Scale 1.00 [ 0.00]( 4.34) 0.99 [ -0.58]( 3.88)
Add 1.00 [ 0.00]( 4.14) 1.02 [ 1.96]( 1.71)
Triad 1.00 [ 0.00]( 1.00) 0.99 [ -0.50]( 2.43)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) proxy_exec[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.41) 1.02 [ 2.40]( 0.32)
2-clients 1.00 [ 0.00]( 0.58) 1.02 [ 2.21]( 0.30)
4-clients 1.00 [ 0.00]( 0.35) 1.02 [ 2.20]( 0.63)
8-clients 1.00 [ 0.00]( 0.48) 1.02 [ 1.98]( 0.50)
16-clients 1.00 [ 0.00]( 0.66) 1.02 [ 2.19]( 0.49)
32-clients 1.00 [ 0.00]( 1.15) 1.02 [ 2.17]( 0.75)
64-clients 1.00 [ 0.00]( 1.38) 1.01 [ 1.43]( 1.39)
128-clients 1.00 [ 0.00]( 0.87) 1.01 [ 0.60]( 1.09)
256-clients 1.00 [ 0.00]( 5.36) 1.01 [ 0.54]( 4.29)
512-clients 1.00 [ 0.00](54.39) 0.99 [ -0.61](52.23)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) proxy_exec[pct imp](CV)
1 1.00 [ -0.00]( 8.54) 0.76 [ 23.91](23.47)
2 1.00 [ -0.00]( 1.15) 0.90 [ 10.00]( 8.11)
4 1.00 [ -0.00](13.46) 1.10 [-10.42](10.94)
8 1.00 [ -0.00]( 7.14) 0.89 [ 10.53]( 3.92)
16 1.00 [ -0.00]( 3.49) 1.00 [ -0.00]( 8.93)
32 1.00 [ -0.00]( 1.06) 0.96 [ 4.26](10.99)
64 1.00 [ -0.00]( 5.48) 1.08 [ -8.14]( 4.03)
128 1.00 [ -0.00](10.45) 1.09 [ -8.64](13.37)
256 1.00 [ -0.00](31.14) 1.12 [-11.66](16.77)
512 1.00 [ -0.00]( 1.52) 0.98 [ 2.02]( 1.50)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) proxy_exec[pct imp](CV)
1 1.00 [ 0.00]( 1.07) 1.00 [ -0.29]( 0.53)
2 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.15)
4 1.00 [ 0.00]( 0.00) 1.00 [ -0.29]( 0.30)
8 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.00)
16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 3.41) 1.03 [ 3.50]( 0.27)
64 1.00 [ 0.00]( 1.05) 1.00 [ -0.38]( 4.45)
128 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.19)
256 1.00 [ 0.00]( 0.72) 0.99 [ -0.61]( 0.63)
512 1.00 [ 0.00]( 0.57) 1.00 [ -0.24]( 0.33)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) proxy_exec[pct imp](CV)
1 1.00 [ -0.00]( 9.11) 0.81 [ 18.75](10.25)
2 1.00 [ -0.00]( 0.00) 0.86 [ 14.29](11.08)
4 1.00 [ -0.00]( 3.78) 1.29 [-28.57](17.25)
8 1.00 [ -0.00]( 0.00) 1.17 [-16.67]( 3.60)
16 1.00 [ -0.00]( 7.56) 1.00 [ -0.00]( 6.88)
32 1.00 [ -0.00](15.11) 0.80 [ 20.00]( 0.00)
64 1.00 [ -0.00]( 9.63) 0.95 [ 5.00]( 7.32)
128 1.00 [ -0.00]( 4.86) 0.96 [ 3.52]( 8.69)
256 1.00 [ -0.00]( 2.34) 0.95 [ 4.70]( 2.78)
512 1.00 [ -0.00]( 0.40) 0.99 [ 0.77]( 0.20)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) proxy_exec[pct imp](CV)
1 1.00 [ -0.00]( 2.73) 1.02 [ -1.82]( 3.15)
2 1.00 [ -0.00]( 0.87) 1.02 [ -2.16]( 1.90)
4 1.00 [ -0.00]( 1.21) 1.04 [ -3.77]( 2.76)
8 1.00 [ -0.00]( 0.27) 1.01 [ -1.31]( 2.01)
16 1.00 [ -0.00]( 4.04) 1.00 [ 0.27]( 0.77)
32 1.00 [ -0.00]( 7.35) 0.89 [ 11.07]( 1.68)
64 1.00 [ -0.00]( 3.54) 1.02 [ -1.55]( 1.47)
128 1.00 [ -0.00]( 0.37) 1.00 [ 0.41]( 0.11)
256 1.00 [ -0.00]( 9.57) 0.91 [ 8.84]( 3.64)
512 1.00 [ -0.00]( 1.82) 1.02 [ -1.93]( 1.21)
==================================================================
Test : Various longer running benchmarks
Units : %diff in throughput reported
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmarks: %diff
ycsb-cassandra 0.82%
ycsb-mongodb -0.45%
deathstarbench-1x 2.44%
deathstarbench-2x 1.88%
deathstarbench-3x 0.09%
deathstarbench-6x 1.94%
hammerdb+mysql 16VU 3.65%
hammerdb+mysql 64VU -0.59%
As I’m trying to submit this work in smallish digestible pieces,
in this series, I’m only submitting for review the logic that
allows us to do the proxying if the lock owner is on the same
runqueue as the blocked waiter: Introducing the
CONFIG_SCHED_PROXY_EXEC option and boot-argument, reworking the
task_struct::blocked_on pointer and wrapper functions, the
initial sketch of the find_proxy_task() logic, some fixes for
using split contexts, and finally same-runqueue proxying.
As I mentioned above, for the series I’m submitting here, it has
only barely changed from v17. With the main difference being
slightly different order of checks for cases where we don’t
actually do anything yet (more on why below), and use of
READ_ONCE for the on_rq reads to avoid the compiler fusing
loads, which I was bitten by with the full series.
For this series (Single RunQueue Proxy), feel free to include:
Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
I'll go and test the full series next and reply with the
results on this same thread sometime next week. Meanwhile I'll
try to queue a longer locktorture run over the weekend. I'll
let you know if I see anything out of the ordinary on my setup.
--
Thanks and Regards,
Prateek