[PATCH v3 0/2] mm/mglru: fix cgroup OOM during MGLRU state switching

From: Leno Hou via B4 Relay

Date: Mon Mar 16 2026 - 01:56:54 EST

When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim path.
This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.

Problem Description
==================

The issue arises from a "reclaim vacuum" during the transition.

1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
false before the pages are drained from MGLRU lists back to traditional
LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
yet, or the changes are not yet visible to all CPUs due to a lack
of synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
concludes there is no reclaimable memory, and triggers an OOM kill.

A similar race can occur during enablement, where the reclaimer sees the
new state but the MGLRU lists haven't been populated via fill_evictable()
yet.

Solution
========

Introduce a 'draining' state (`lru_drain_core`) to bridge the transition.
When transitioning, the system enters this intermediate state where
the reclaimer is forced to attempt both MGLRU and traditional reclaim
paths sequentially. This ensures that folios remain visible to at least
one reclaim mechanism until the transition is fully materialized across
all CPUs.

Changes
=======

v3:
- Rebase onto mm-new branch for queue testing
- Don't look around while draining
- Fix Barry Song's comment

v2:
- Repalce with a static branch `lru_drain_core` to track the transition
state.
- Ensures all LRU helpers correctly identify page state by checking
folio_lru_gen(folio) != -1 instead of relying solely on global flags.
- Maintain workingset refault context across MGLRU state transitions
- Fix build error when CONFIG_LRU_GEN is disabled.

v1:
- Use smp_store_release() and smp_load_acquire() to ensure the visibility
of 'enabled' and 'draining' flags across CPUs.
- Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
is in the 'draining' state, the reclaimer will attempt to scan MGLRU
lists first, and then fall through to traditional LRU lists instead
of returning early. This ensures that folios are visible to at least
one reclaim path at any given time.

Race & Mitigation
================

A race window exists between checking the 'draining' state and performing
the actual list operations. For instance, a reclaimer might observe the
draining state as false just before it changes, leading to a suboptimal
reclaim path decision.

However, this impact is effectively mitigated by the kernel's reclaim
retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
to find eligible folios due to a state transition race, subsequent retries
in the loop will observe the updated state and correctly direct the scan
to the appropriate LRU lists. This ensures the transient inconsistency
does not escalate into a terminal OOM kill.

This effectively reduce the race window that previously triggered OOMs
under high memory pressure.

Reproduction
===========

The issue was consistently reproduced on v6.1.157 and v6.18.3 using a
high-pressure memory cgroup (v1) environment.

Reproduction steps:
1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
and 8GB active anonymous memory.
2. Toggle MGLRU state while performing new memory allocations to force
direct reclaim.

Reproduction script
===================

```bash

MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"

switch_mglru() {
local orig_val=$(cat "$MGLRU_FILE")
if [[ "$orig_val" != "0x0000" ]]; then
echo n > "$MGLRU_FILE" &
else
echo y > "$MGLRU_FILE" &
fi
}

mkdir -p "$CGROUP_PATH"
echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
echo $$ > "$CGROUP_PATH/cgroup.procs"

dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache

stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
sleep 5

switch_mglru
stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || \
echo "OOM Triggered"

grep oom_kill "$CGROUP_PATH/memory.oom_control"
```

Signed-off-by: Leno Hou <lenohou@xxxxxxxxx>
---
Leno Hou (2):
mm/mglru: fix cgroup OOM during MGLRU state switching
mm/mglru: maintain workingset refault context across state transitions

include/linux/mm_inline.h | 16 ++++++++++++++
include/linux/swap.h | 2 +-
mm/rmap.c | 2 +-
mm/swap.c | 15 +++++++------
mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++------------
mm/workingset.c | 22 +++++++++++++------
6 files changed, 83 insertions(+), 29 deletions(-)
---
base-commit: c5a81ff6071bcf42531426e6336b5cc424df6e3d
change-id: 20260311-b4-switch-mglru-v2-8b926a03843f

Best regards,
--
Leno Hou <lenohou@xxxxxxxxx>