Re: [PATCH 0/8] per-memcg-per-node kmem accounting
From: Alexandre Ghiti
Date: Wed May 20 2026 - 04:58:46 EST
Hi Joshua,
On 5/18/26 16:57, Joshua Hahn wrote:
On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@xxxxxxxx> wrote:
This series pursues the work initiated by Joshua [1]. We need kernelHello Alex,
memory to be accounted on a per-node basis in order to be able to
know the memcg and physical memory association.
This series takes advantage of the recent introduction of per-node
obj_cgroup [2] and makes those obj_cgroup tied to their numa node.
The bulk of the series is percpu per-node accounting: percpu
"precharges" the memcg before we know the actual location of the pages
it uses, so charging and accounting had to be split. All other kmem
users (slab, zswap, __memcg_kmem_charge_page) are straightforward
conversions (zswap support is limited in this series because Joshua
is working on it in parallel [3]).
Thanks Joshua for your early feedbacks!
Thank you for your work!
Overall I think the direction makes sense to me. Pre-overcharging makes sense to
me as an approach, we would much rather overaccount than underaccount and
later have to breach limits.
I do have some concerns on performance, though. Namely, I think there are
some expensive operations that I think would benefit from some performane
benchmarking with this patch added (maybe some simple microbenchmarks that
demonstrates kernel allocation overhead could be useful).
From what I can tell, there is some additional performance overhead that has
to do with iterating over num_possible_cpus() x pages_per_alloc, which
doesn't seem trivial to me.
Indeed, let me microbenchmark the overhead on a large system.
Another concern that I see is the stock credit system. Maybe we could be
bypassing the stock check leading to more time spent doing the atomic
operations.
I'm not following on this one, which atomic operations do you see that could be bypassed?
obj_stock caches a single obj_cgroup, which means that if we split the objcg
to be per-node (in patch 6), then the obj_stock basically gets invalidated
every operation since we iterate over more objcgs (even though we are in
the same logical objcg). Maybe I'm missing something?
The objcg split comes from commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") and the problem you describe is exactly what Shakeel is trying to fix [1].
But I remember trying a microbenchmark and noticed a +5% regression (on top of the 67% then...), I'll rebase this series on top of Shakeel's and re-run.
[1] https://lore.kernel.org/linux-mm/20260520053123.2709959-1-shakeel.butt@xxxxxxxxx/T/#m127d4969b105c046a2a21e3c79c963771007583d
I haven't taken a deep look at the implementation details but just wanted to
raise some high level items that I noticed. Of course, all of these concerns
are just theoretical, if you can show that the performance delta is not
noticable then all of my concerns don't matter.
I also want to talk more about the local credit system but let's first see
what the numbers are first.
Thanks again, Alex. And I really like patch 2 because it is a solution to
a problem that I ran into in my percpu tracking series that I couldn't think
of before! Thank you for solving my problem too : -)
Great then, thanks :)
Alex
Have a great day!
Joshua
[1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@xxxxxxxxx/
[2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@xxxxxxxxxxxxx/
[3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@xxxxxxxxx/
Alexandre Ghiti (8):
mm: memcontrol: propagate NMI slab stats to memcg vmstats
mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT
mm: percpu: Split memcg charging and kmem accounting
mm: memcontrol: track MEMCG_KMEM per NUMA node
mm: memcontrol: per-node kmem accounting for page charges
mm: slab: per-node kmem accounting for slab
mm: percpu: per-node kmem accounting using local credit
mm: zswap: per-node kmem accounting for zswap/zsmalloc
include/linux/memcontrol.h | 27 +++++--
include/linux/mmzone.h | 1 +
include/linux/zsmalloc.h | 2 +
mm/memcontrol.c | 150 ++++++++++++++++++++++++++++---------
mm/percpu-internal.h | 16 +---
mm/percpu.c | 90 ++++++++++++++++++++--
mm/vmstat.c | 1 +
mm/zsmalloc.c | 11 +++
mm/zswap.c | 9 ++-
9 files changed, 242 insertions(+), 65 deletions(-)
--
2.54.0