Re: [PATCH 0/6 v3] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

From: Joshua Hahn

Date: Fri Jun 05 2026 - 13:17:59 EST

On Fri, 5 Jun 2026 08:35:56 -0700 Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:

> Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
> allocations, allowing small allocations to avoid walking the expensive
> mem_cgroup hierarchy traversal and atomic operations on each charge.
> This design introduces a fastpath, but there is room for improvement:
>
> 1. Currently, each CPU tracks up to 7 (NR_MEMCG_STOCK) mem_cgroups. When
> more than 7 mem_cgroups are actively charging on a single CPU, a
> random victim is evicted and its associated stock is drained.
>
> 2. Stock management is tightly coupled to struct mem_cgroup, which makes
> it difficult to add a new page_counter to mem_cgroup and have
> multiple sources of stock management, which is required when trying
> to introduce fastpaths to multiple hard limit checks.
>
> This series moves the per-cpu stock down into the page_counter which
> consolidates stock limit checking and page_counter limit checking into
> page_counter_try_charge. This eliminates the 7-memcg-per-cpu slot limit,
> the random evictions (drain & refill), and slot traversal.

Hello reviewers,

I was hoping to receive some input on a point that Sashiko raises.
The draining work we do per-cpu uses work_on_cpu(), which does
schedule_work_on() and flush_work() on the system_percpu_wq, which is
not WQ_MEM_RECLAIM. And drain_all_stock() runs from try_charge_memcg()
on the reclaim path, so it actually triggers the check_flush_dependency()
since a wq_mem_reclaim is flushing a !wq_mem_reclaim.

In my testing, I haven't seen this become an issue. The flushing work
and draining only takes a local_lock() and does atomic operations,
and it never allocates, so there is no question on whether we can make
forward progress.

But this does slip up the WARN_ON since this is not obvious to the system.
I see three options here:

1. Trust that this is OK, and document that we can alwyas make forward
progress.
2. Keep the draining work synchronous, but queue and flush on memcg_wq
marked WQ_MEM_RECLAIM instead of just using work_on_cpu(). This would
add 2 words per-cpu-memcg for the work struct backpointer.
3. Go back to asynchronous, which would get rid of all the synchronous
concerns, but add an additional 2 words per-cpu-memcg for the work
struct backpointer here as well.

What do you think is the right decision here? I was thinking about this
quite a bit recently but just decided to send it out, but I think I should
have asked for upstream opinion sooner...

I would prefer to keep the memory footprint of this series minimal, and
opting to do things synchronously helped achieve this goal since we can
get rid of the backpointers. But I think this is beginning to show up as
a tradeoff, so I would really appreciate any input on what seems to be
the best decision here.

Thank you very much for your time!
Joshua