Re: [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru

From: Johannes Weiner

Date: Wed Mar 18 2026 - 18:49:11 EST

On Wed, Mar 18, 2026 at 09:25:17PM +0100, David Hildenbrand (Arm) wrote:
> On 3/12/26 21:51, Johannes Weiner wrote:
> > The deferred split queue handles cgroups in a suboptimal fashion. The
> > queue is per-NUMA node or per-cgroup, not the intersection. That means
> > on a cgrouped system, a node-restricted allocation entering reclaim
> > can end up splitting large pages on other nodes:
> >
> > alloc/unmap
> > deferred_split_folio()
> > list_add_tail(memcg->split_queue)
> > set_shrinker_bit(memcg, node, deferred_shrinker_id)
> >
> > for_each_zone_zonelist_nodemask(restricted_nodes)
> > mem_cgroup_iter()
> > shrink_slab(node, memcg)
> > shrink_slab_memcg(node, memcg)
> > if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> > deferred_split_scan()
> > walks memcg->split_queue
> >
> > The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> > has a single large page on the node of interest, all large pages owned
> > by that memcg, including those on other nodes, will be split.
> >
> > list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> > streamlines a lot of the list operations and reclaim walks. It's used
> > widely by other major shrinkers already. Convert the deferred split
> > queue as well.
> >
> > The list_lru per-memcg heads are instantiated on demand when the first
> > object of interest is allocated for a cgroup, by calling
> > memcg_list_lru_alloc_folio(). Add calls to where splittable pages are
> > created: anon faults, swapin faults, khugepaged collapse.
> >
> > These calls create all possible node heads for the cgroup at once, so
> > the migration code (between nodes) doesn't need any special care.
>
>
> [...]
>
> > -
> > static inline bool is_transparent_hugepage(const struct folio *folio)
> > {
> > if (!folio_test_large(folio))
> > @@ -1293,6 +1189,14 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
> > count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> > return NULL;
> > }
> > +
> > + if (memcg_list_lru_alloc_folio(folio, &deferred_split_lru, gfp)) {
> > + folio_put(folio);
> > + count_vm_event(THP_FAULT_FALLBACK);
> > + count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
> > + return NULL;
> > + }
>
> So, in all anon alloc paths, we essentialy have
>
> 1) vma_alloc_folio / __folio_alloc (khugepaged being odd)
> 2) mem_cgroup_charge / mem_cgroup_swapin_charge_folio
> 3) memcg_list_lru_alloc_folio
>
> I wonder if we could do better in most cases and have something like a
>
> vma_alloc_anon_folio()
>
> That wraps the vma_alloc_folio() + memcg_list_lru_alloc_folio(), but
> still leaves the charging to the caller?

Hm, but it's the charging that figures out the memcg and sets
folio_memcg() :(

> The would at least combine 1) and 3) in a single API. (except for the
> odd cases without a VMA).
>
> I guess we would want to skip the memcg_list_lru_alloc_folio() for
> order-0 folios, correct?

Yeah, we don't use the queue for < order-1. In deferred_split_folio():

/*
* Order 1 folios have no space for a deferred list, but we also
* won't waste much memory by not adding them to the deferred list.
*/
if (folio_order(folio) <= 1)
return;

> > @@ -3802,33 +3706,28 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
> > struct folio *new_folio, *next;
> > int old_order = folio_order(folio);
> > int ret = 0;
> > - struct deferred_split *ds_queue;
> > + struct list_lru_one *l;
> >
> > VM_WARN_ON_ONCE(!mapping && end);
> > /* Prevent deferred_split_scan() touching ->_refcount */
> > - ds_queue = folio_split_queue_lock(folio);
> > + rcu_read_lock();
>
> The RCU lock is for the folio_memcg(), right?
>
> I recall I raised in the past that some get/put-like logic (that wraps
> the rcu_read_lock() + folio_memcg()) might make this a lot easier to get.
>
>
> memcg = folio_memcg_lookup(folio)
>
> ... do stuff
>
> folio_memcg_putback(folio, memcg);
>
> Or sth like that.
>
>
> Alternativey, you could have some helpers that do the
> list_lru_lock+unlock etc.
>
> folio_memcg_list_lru_lock()
> ...
> folio_memcg_list_ru_unlock(l);
>
> Just some thoughts as inspiration :)

I remember you raising this in the objcg + reparenting patches. There
are a few more instances of

rcu_read_lock()
foo = folio_memcg()
...
rcu_read_unlock()

in other parts of the code not touched by these patches here, so the
first pattern is a more universal encapsulation.

Let me look into this. Would you be okay with a follow-up that covers
the others as well?