Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

From: Mateusz Guzik
Date: Thu Apr 24 2025 - 12:03:38 EST

Next message: Andrii Nakryiko: "Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU"
Previous message: tip-bot2 for Thomas Gleixner: " [tip: timers/urgent] timekeeping: Prevent coarse clocks going backwards"
In reply to: Christoph Lameter (Ampere): "Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem"
Next in thread: Christoph Lameter (Ampere): "Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Apr 24, 2025 at 5:50 PM Christoph Lameter (Ampere)
<cl@xxxxxxxxxx> wrote:
>
> On Thu, 24 Apr 2025, Harry Yoo wrote:
>
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
>
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
>
> Could you keep a cache of recently used per cpu regions so that you can
> avoid frequent percpu allocation operation?
>
> You could allocate larger percpu areas for a batch of them and
> then assign as needed.

I was considering a mechanism like that earlier, but the changes
needed to make it happen would result in worse state for the
alloc/free path.

RSS counters are embedded into mm with only the per-cpu areas being a
pointer. The machinery maintains a global list of all of their
instances, i.e. the pointers to internal to mm_struct. That is to say
even if you deserialized allocation of percpu memory itself, you would
still globally serialize on adding/removing the counters to the global
list.

But suppose this got reworked somehow and this bit ceases to be a problem.

Another spot where mm alloc/free globally serializes (at least on
x86_64) is pgd_alloc/free on the global pgd_lock.

Suppose you managed to decompose the lock into a finer granularity, to
the point where it does not pose a problem from contention standpoint.
Even then that's work which does not have to happen there.

General theme is there is a lot of expensive work happening when
dealing with mm lifecycle (*both* from single- and multi-threaded
standpoint) and preferably it would only be dealt with once per
object's existence.
--
Mateusz Guzik <mjguzik gmail.com>

Next message: Andrii Nakryiko: "Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU"
Previous message: tip-bot2 for Thomas Gleixner: " [tip: timers/urgent] timekeeping: Prevent coarse clocks going backwards"
In reply to: Christoph Lameter (Ampere): "Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem"
Next in thread: Christoph Lameter (Ampere): "Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]