Re: [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru

From: David Hildenbrand (Arm)

Date: Wed Mar 18 2026 - 16:25:55 EST

On 3/12/26 21:51, Johannes Weiner wrote:
> The deferred split queue handles cgroups in a suboptimal fashion. The
> queue is per-NUMA node or per-cgroup, not the intersection. That means
> on a cgrouped system, a node-restricted allocation entering reclaim
> can end up splitting large pages on other nodes:
>
> alloc/unmap
> deferred_split_folio()
> list_add_tail(memcg->split_queue)
> set_shrinker_bit(memcg, node, deferred_shrinker_id)
>
> for_each_zone_zonelist_nodemask(restricted_nodes)
> mem_cgroup_iter()
> shrink_slab(node, memcg)
> shrink_slab_memcg(node, memcg)
> if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> deferred_split_scan()
> walks memcg->split_queue
>
> The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> has a single large page on the node of interest, all large pages owned
> by that memcg, including those on other nodes, will be split.
>
> list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> streamlines a lot of the list operations and reclaim walks. It's used
> widely by other major shrinkers already. Convert the deferred split
> queue as well.
>
> The list_lru per-memcg heads are instantiated on demand when the first
> object of interest is allocated for a cgroup, by calling
> memcg_list_lru_alloc_folio(). Add calls to where splittable pages are
> created: anon faults, swapin faults, khugepaged collapse.
>
> These calls create all possible node heads for the cgroup at once, so
> the migration code (between nodes) doesn't need any special care.

[...]

> -
> static inline bool is_transparent_hugepage(const struct folio *folio)
> {
> if (!folio_test_large(folio))
> @@ -1293,6 +1189,14 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
> count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> return NULL;
> }
> +
> + if (memcg_list_lru_alloc_folio(folio, &deferred_split_lru, gfp)) {
> + folio_put(folio);
> + count_vm_event(THP_FAULT_FALLBACK);
> + count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
> + return NULL;
> + }

So, in all anon alloc paths, we essentialy have

1) vma_alloc_folio / __folio_alloc (khugepaged being odd)
2) mem_cgroup_charge / mem_cgroup_swapin_charge_folio
3) memcg_list_lru_alloc_folio

I wonder if we could do better in most cases and have something like a

vma_alloc_anon_folio()

That wraps the vma_alloc_folio() + memcg_list_lru_alloc_folio(), but
still leaves the charging to the caller?

The would at least combine 1) and 3) in a single API. (except for the
odd cases without a VMA).

I guess we would want to skip the memcg_list_lru_alloc_folio() for
order-0 folios, correct?

> +
> folio_throttle_swaprate(folio, gfp);
>
> /*
> @@ -3802,33 +3706,28 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
> struct folio *new_folio, *next;
> int old_order = folio_order(folio);
> int ret = 0;
> - struct deferred_split *ds_queue;
> + struct list_lru_one *l;
>
> VM_WARN_ON_ONCE(!mapping && end);
> /* Prevent deferred_split_scan() touching ->_refcount */
> - ds_queue = folio_split_queue_lock(folio);
> + rcu_read_lock();

The RCU lock is for the folio_memcg(), right?

I recall I raised in the past that some get/put-like logic (that wraps
the rcu_read_lock() + folio_memcg()) might make this a lot easier to get.

memcg = folio_memcg_lookup(folio)

... do stuff

folio_memcg_putback(folio, memcg);

Or sth like that.

Alternativey, you could have some helpers that do the
list_lru_lock+unlock etc.

folio_memcg_list_lru_lock()
...
folio_memcg_list_ru_unlock(l);

Just some thoughts as inspiration :)

> + l = list_lru_lock(&deferred_split_lru, folio_nid(folio), folio_memcg(folio));
> if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {
> struct swap_cluster_info *ci = NULL;
> struct lruvec *lruvec;
>
> if (old_order > 1) {
> - if (!list_empty(&folio->_deferred_list)) {
> - ds_queue->split_queue_len--;
> - /*
> - * Reinitialize page_deferred_list after removing the
> - * page from the split_queue, otherwise a subsequent
> - * split will see list corruption when checking the
> - * page_deferred_list.
> - */
> - list_del_init(&folio->_deferred_list);
> - }
> + __list_lru_del(&deferred_split_lru, l,
> + &folio->_deferred_list, folio_nid(folio));
> if (folio_test_partially_mapped(folio)) {
> folio_clear_partially_mapped(folio);
> mod_mthp_stat(old_order,
> MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> }
> }
> - split_queue_unlock(ds_queue);
> + list_lru_unlock(l);
> + rcu_read_unlock();
> +
> if (mapping) {

[...]

Most changes here look mostly mechanically, quite nice. I'll probably
have to go over some bits once again with a fresh mind :)

--
Cheers,

David