Re: [REGRESSION] slab: replace cpu (partial) slabs with sheaves

From: Uladzislau Rezki

Date: Thu Mar 26 2026 - 14:17:14 EST


On Thu, Mar 26, 2026 at 03:42:02PM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/26/26 13:43, Aishwarya Rambhadran wrote:
> > Hi Vlastimil, Harry,
>
> Hi!
>
> > We have observed few kernel performance benchmark regressions,
> > mainly in perf & vmalloc workloads, when comparing v6.19 mainline
> > kernel results against later releases in the v7.0 cycle.
> > Independent bisections on different machines consistently point
> > to commits within the slab percpu sheaves series. However, towards
> > the end of the bisection, the signal becomes less clear, so it's
> > not yet certain which specific commit within the series is the
> > root cause.
> >
> > The workloads were triggered on AWS Graviton3 (arm64) & AWS Intel
> > Sapphire Rapids (x86_64) systems in which the regressions are
> > reproducible across different kernel release candidates.
> > (R)/(I) mean statistically significant regression/improvement,
> > where "statistically significant" means the 95% confidence
> > intervals do not overlap”.
> >
> > Below given are the performance benchmark results generated by
> > Fastpath Tool, for different kernel -rc versions relative to the
> > base version v6.19, executed on the mentioned SUTs. The perf/
> > syscall benchmarks (execve/fork) regress consistently by ~6–11% on
> > both arm64 and x86_64 across v7.0-rc1 to rc5, while vmalloc
> > workloads show smaller but stable regressions (~2–10%), particularly
> > in kvfree_rcu paths.
> >
> > Regressions on AWS Intel Sapphire Rapids (x86_64) :
>
> The table formatting is broken for me, can you resend it please? Maybe a
> .txt attachment would work better.
>
> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> > | Benchmark       | Result Class            |   6-19-0 (base) | 
> >  7-0-0-rc1 |   7-0-0-rc2 |  7-0-0-rc2-gaf4e9ef3d784 |   7-0-0-rc3 | 
> >  7-0-0-rc4 |   7-0-0-rc5 |
> > +=================+==========================================================+=================+=============+=============+===========================+=============+=============+=============+
> > | micromm/vmalloc | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000
> > (usec) |       262605.17 |      -4.94% |      -7.48% |             (R)
> > -8.11% |      -4.51% |      -6.23% |      -3.47% |
> > |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000
> > (usec) |       253198.67 |      -7.56% | (R) -10.57% |            (R)
> > -10.13% |  (R) -7.07% |      -6.37% |      -6.55% |
> > |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)          
> >  |       197904.67 |      -2.07% |      -3.38% |             -2.07% | 
> >     -2.97% |  (R) -4.30% |      -3.39% |
> > |                 | random_size_align_alloc_test: p:1, h:0, l:500000
> > (usec)  |      1707089.83 |      -2.63% |  (R) -3.69% |              
> > (R) -3.25% |  (R) -2.87% |      -2.22% |  (R) -3.63% |
> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> > | perf/syscall    | execve (ops/sec)            |         1202.92 |  (R)
> > -7.15% |  (R) -7.05% |         (R) -7.03% |  (R) -7.93% |  (R) -6.51% | 
> > (R) -7.36% |
> > |                 | fork (ops/sec)            |          996.00 |  (R)
> > -9.00% | (R) -10.27% |         (R) -9.92% | (R) -11.19% | (R) -10.69% |
> > (R) -10.28% |
> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> >
> > Regressions on AWS Graviton3 (arm64) :
> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> > | Benchmark       | Result Class            |   6-19-0 (base) | 
> >  7-0-0-rc1 |   7-0-0-rc2 |  7-0-0-rc2-gaf4e9ef3d784 |   7-0-0-rc3 | 
> >  7-0-0-rc4 |   7-0-0-rc5 |
> > +=================+==========================================================+=================+=============+=============+===========================+=============+=============+=============+
> > | micromm/vmalloc | fix_size_alloc_test: p:1, h:0, l:500000 (usec)     
> >      |       320101.50 |  (R) -4.72% |  (R) -3.81% |               (R)
> > -5.05% |      -3.06% |      -3.16% |  (R) -3.91% |
> > |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)     
> >      |       522072.83 |  (R) -2.15% |      -1.25% |               (R)
> > -2.16% |  (R) -2.13% |      -2.10% |      -1.82% |
> > |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)     
> >     |      1041640.33 |      -0.50% |  (R) -2.04% |                
> > -1.43% |      -0.69% |      -1.78% |  (R) -2.03% |
> > |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)   
> >      |      2255794.00 |      -1.51% |  (R) -2.24% |             (R)
> > -2.33% |      -1.14% |      -0.94% |      -1.60% |
> > |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000
> > (usec) |       343543.83 |  (R) -4.50% |  (R) -3.54% |             (R)
> > -5.00% |  (R) -4.88% |  (R) -4.01% |  (R) -5.54% |
> > |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000
> > (usec) |       342290.33 |  (R) -5.15% |  (R) -3.24% |             (R)
> > -3.76% |  (R) -5.37% |  (R) -3.74% |  (R) -5.51% |
> > |                 | random_size_align_alloc_test: p:1, h:0, l:500000
> > (usec)  |      1209666.83 |      -2.43% |      -2.09% |                
> >   -1.19% |  (R) -4.39% |      -1.81% |      -3.15% |
> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> > | perf/syscall    | execve (ops/sec)            |         1219.58 |     
> >        |  (R) -8.12% |         (R) -7.37% |  (R) -7.60% |  (R) -7.86%
> > |  (R) -7.71% |
> > |                 | fork (ops/sec)            |          863.67 |       
> >      |  (R) -7.24% |         (R) -7.07% |  (R) -6.42% |  (R) -6.93% | 
> > (R) -6.55% |
> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> >
> >
> > The details of latest bisections that were carried out for the above
> > listed regressions, are given below :
> > -Graviton3 (arm64)
> >  good: v6.19 (05f7e89ab973)
> >  bad:  v7.0-rc2 (11439c4635ed)
> >  workload: perf/syscall (execve)
> >  bisected to: f1427a1d6415 (“slab: make percpu sheaves compatible with
> >  kmalloc_nolock()/kfree_nolock()”)
> >
> > -Sapphire Rapids (x86_64)
> >  good: v6.19 (05f7e89ab973)
> >  bad:  v7.0-rc3 (1f318b96cc84)
> >  workload: perf/syscall (fork)
> >  bisected to: f1427a1d6415 (“slab: make percpu sheaves compatible with
> >  kmalloc_nolock()/kfree_nolock()”)
> >
> > -Graviton3 (arm64)
> >  good: v6.19 (05f7e89ab973)
> >  bad:  v7.0-rc3 (1f318b96cc84)
> >  workload: perf/syscall (execve)
> >  bisected to: f3421f8d154c (“slab: introduce percpu sheaves bootstrap”)
>
> Yeah none of these are likely to introduce the regression.
> We've seen other reports from e.g. lkp pointing to later commits that remove
> the cpu (partial) slabs. The theory is that on benchmarks that stress vma
> and maple node caches (fork and execve are likely those), the introduction
> of sheaves in 6.18 (for those caches only) resulted in ~doubled percpu
> caching capacity (and likely associated performance increase) - by sheaves
> backed by cpu (partial) slabs,. Removing the latter then looks like a
> regression in isolation in the 7.0 series.
>
> A regression of vmalloc related to kvfree_rcu might be new. Although if it's
> kvfree_rcu() of vmalloc'd objects, it would be weird. More likely they are
> kvmalloc'd but small enough to be actually kmalloc'd? What are the details
> of that test?
>
static int
kvfree_rcu_2_arg_vmalloc_test(void)
{
struct test_kvfree_rcu *p;
int i;

for (i = 0; i < test_loop_count; i++) {
p = vmalloc(1 * PAGE_SIZE);
if (!p)
return -1;

p->array[0] = 'a';
kvfree_rcu(p, rcu);
}

return 0;
}

static bool kfree_rcu_sheaf(void *obj)
{
struct kmem_cache *s;
struct slab *slab;

if (is_vmalloc_addr(obj))
return false;

slab = virt_to_slab(obj);
if (unlikely(!slab))
return false;

s = slab->slab_cache;
if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id()))
return __kfree_rcu_sheaf(s, obj);

return false;
}

it does not go via sheaf since it is a vmalloc address.

--
Uladzislau Rezki