Re: [REGRESSION] slab: replace cpu (partial) slabs with sheaves

From: Aishwarya Rambhadran

Date: Fri Mar 27 2026 - 12:45:46 EST

Hi all,

Thanks for the discussion and the insights.

For completeness, the SUTs used are single NUMA node:
$ numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 257218 MB
node 0 free: 255376 MB
node distances:
node 0
0: 10

As suggested by Ryan, I re-ran and compared the perf benchmarks
across 6.17, 6.18, and later kernels. The behavior is consistent
with what has been discussed in this thread and aligns with our
observations.

Thanks again for the clarifications and apologies for the table
rendering issues in the initial email.

Regards,
Aishwarya Rambhadran

On 27/03/26 4:51 PM, Vlastimil Babka (SUSE) wrote:

On 3/27/26 11:00, Harry Yoo (Oracle) wrote:

On Fri, Mar 27, 2026 at 08:58:36AM +0000, Ryan Roberts wrote:

On 3/26/26 13:43, Aishwarya Rambhadran wrote:
Right so there should be just the overhead of the extra is_vmalloc_addr() test. Possibly also the call of kfree_rcu_sheaf() if it's not inlined. I'd say it's something we can just accept? It seems this is a unit test being used as a microbenchmark, so it can be very sensitive even to such details, but it should be negligible in practice.
The perf/syscall cases might be a bit more concerning though? (those tests are from "perf bench syscall fork|execve"). Yes they are microbenchmarks, but a 7% increased cost for fork seems like something we'd want to avoid if we can.
Sure, I tried to explain those in my first reply. Harry then linked to how that explanation can be verified. Hopefully it's really the same reason.
Ahh sorry I missed your first email. We only added that benchmark from 6.19 so don't have results for earlier kernels, but I'll ask Aishu to run it for 6.17 and 6.18 to see if the results correlate with your expectation. But from a high level perspective, a 7% regression on fork is not ideal even if there was a 7% improvement in 6.18.
In retrospect it was an oversight not to disable the pre-existing cpu caching layer immediately for sheaf-enabled caches in 6.18. Can't undo that mistake now, unfortunately.
If that improvement comes from the number of objects cached per CPU, I'm not sure if determining the default value (# of cached objs) based on "a point when microbenchmarks stop improving" is a reasonable measure because the default value affects all slab caches and will inevitably increase overall memory usage.
Yeah that's the thing, some workloads might just keep improving as you throw more caching at them, but there's a memory usage cost to that. A case of stress test doing nothing but forks might also not be representative of performance of forks under normal workload where other operations also happen, returning the related slab objects, so in the end it doesn't expose the batch size that much.
Hopefully we could discuss what a reasonable heuristic that "works for most situations" looks like, and allow users to tune it further based on their needs. As a side note, changing sheaf capacity at runtime is not supported yet (I'm working on it) and targeting at least before the next LTS.