Re: [PATCH v5] mm/mglru: fix cgroup OOM during MGLRU state switching

From: Axel Rasmussen

Date: Thu Mar 19 2026 - 17:05:25 EST


Looks reasonable to me, no objections. Implementation is pretty simple.

Reviewed-by: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>

On Thu, Mar 19, 2026 at 1:49 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Thu, Mar 19, 2026 at 11:40 AM Leno Hou via B4 Relay
> <devnull+lenohou.gmail.com@xxxxxxxxxx> wrote:
> >
> > From: Leno Hou <lenohou@xxxxxxxxx>
> >
> > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > condition exists between the state switching and the memory reclaim path.
> > This can lead to unexpected cgroup OOM kills, even when plenty of
> > reclaimable memory is available.
> >
> > Problem Description
> > ==================
> > The issue arises from a "reclaim vacuum" during the transition.
> >
> > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > false before the pages are drained from MGLRU lists back to traditional
> > LRU lists.
> > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > and skip the MGLRU path.
> > 3. However, these pages might not have reached the traditional LRU lists
> > yet, or the changes are not yet visible to all CPUs due to a lack
> > of synchronization.
> > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > concludes there is no reclaimable memory, and triggers an OOM kill.
> >
> > A similar race can occur during enablement, where the reclaimer sees the
> > new state but the MGLRU lists haven't been populated via fill_evictable()
> > yet.
> >
> > Solution
> > ========
> > Introduce a 'switching' state (`lru_switch`) to bridge the transition.
> > When transitioning, the system enters this intermediate state where
> > the reclaimer is forced to attempt both MGLRU and traditional reclaim
> > paths sequentially. This ensures that folios remain visible to at least
> > one reclaim mechanism until the transition is fully materialized across
> > all CPUs.
> >
> > Changes
> > =======
> > v5:
> > - Rename lru_gen_draining to lru_gen_switching; lru_drain_core to
> > lru_switch
> > - Add more documentation for folio_referenced_one
> > - Keep folio_check_references unchanged
> >
> > v4:
> > - Fix Sashiko.dev's AI CodeReview comments
> > - Remove the patch maintain workingset refault context across
> > - Remove folio_lru_gen(folio) != -1 which involved in v2 patch
> >
> > v3:
> > - Rebase onto mm-new branch for queue testing
> > - Don't look around while draining
> > - Fix Barry Song's comment
> >
> > v2:
> > - Replace with a static branch `lru_drain_core` to track the transition
> > state.
> > - Ensures all LRU helpers correctly identify page state by checking
> > folio_lru_gen(folio) != -1 instead of relying solely on global flags.
> > - Maintain workingset refault context across MGLRU state transitions
> > - Fix build error when CONFIG_LRU_GEN is disabled.
> >
> > v1:
> > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > of 'enabled' and 'draining' flags across CPUs.
> > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > lists first, and then fall through to traditional LRU lists instead
> > of returning early. This ensures that folios are visible to at least
> > one reclaim path at any given time.
> >
> > Race & Mitigation
> > ================
> > A race window exists between checking the 'draining' state and performing
> > the actual list operations. For instance, a reclaimer might observe the
> > draining state as false just before it changes, leading to a suboptimal
> > reclaim path decision.
> >
> > However, this impact is effectively mitigated by the kernel's reclaim
> > retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
> > to find eligible folios due to a state transition race, subsequent retries
> > in the loop will observe the updated state and correctly direct the scan
> > to the appropriate LRU lists. This ensures the transient inconsistency
> > does not escalate into a terminal OOM kill.
> >
> > This effectively reduce the race window that previously triggered OOMs
> > under high memory pressure.
> >
> > This fix has been verified on v7.0.0-rc1; dynamic toggling of MGLRU
> > functions correctly without triggering unexpected OOM kills.
> >
> > To: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > To: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> > To: Yuanchu Xie <yuanchu@xxxxxxxxxx>
> > To: Wei Xu <weixugc@xxxxxxxxxx>
> > To: Barry Song <21cnbao@xxxxxxxxx>
> > To: Jialing Wang <wjl.linux@xxxxxxxxx>
> > To: Yafang Shao <laoar.shao@xxxxxxxxx>
> > To: Yu Zhao <yuzhao@xxxxxxxxxx>
> > To: Kairui Song <ryncsn@xxxxxxxxx>
> > To: Bingfang Guo <bfguo@xxxxxxxxxx>
> > Cc: linux-mm@xxxxxxxxx
> > Cc: linux-kernel@xxxxxxxxxxxxxxx
> > Signed-off-by: Leno Hou <lenohou@xxxxxxxxx>
> > ---
> > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > condition exists between the state switching and the memory reclaim path.
> > This can lead to unexpected cgroup OOM kills, even when plenty of
> > reclaimable memory is available.
> >
> > Problem Description
> > ==================
> > The issue arises from a "reclaim vacuum" during the transition.
> >
> > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > false before the pages are drained from MGLRU lists back to traditional
> > LRU lists.
> > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > and skip the MGLRU path.
> > 3. However, these pages might not have reached the traditional LRU lists
> > yet, or the changes are not yet visible to all CPUs due to a lack
> > of synchronization.
> > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > concludes there is no reclaimable memory, and triggers an OOM kill.
> >
> > A similar race can occur during enablement, where the reclaimer sees the
> > new state but the MGLRU lists haven't been populated via fill_evictable()
> > yet.
> >
> > Solution
> > ========
> > Introduce a 'switching' state (`lru_switch`) to bridge the transition.
> > When transitioning, the system enters this intermediate state where
> > the reclaimer is forced to attempt both MGLRU and traditional reclaim
> > paths sequentially. This ensures that folios remain visible to at least
> > one reclaim mechanism until the transition is fully materialized across
> > all CPUs.
> >
> > Changes
> > =======
> > v5:
> > - Rename lru_gen_draining to lru_gen_switching; lru_drain_core to
> > lru_switch
> > - Add more documentation for folio_referenced_one
> > - Keep folio_check_references unchanged
> > v4:
> > - Fix Sashiko.dev's AI CodeReview comments
> > - Remove the patch maintain workingset refault context across
> > - Remove folio_lru_gen(folio) != -1 which involved in v2 patch
> >
> > v3:
> > - Rebase onto mm-new branch for queue testing
> > - Don't look around while draining
> > - Fix Barry Song's comment
> >
> > v2:
> > - Replace with a static branch `lru_drain_core` to track the transition
> > state.
> > - Ensures all LRU helpers correctly identify page state by checking
> > folio_lru_gen(folio) != -1 instead of relying solely on global flags.
> > - Maintain workingset refault context across MGLRU state transitions
> > - Fix build error when CONFIG_LRU_GEN is disabled.
> >
> > v1:
> > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > of 'enabled' and 'draining' flags across CPUs.
> > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > lists first, and then fall through to traditional LRU lists instead
> > of returning early. This ensures that folios are visible to at least
> > one reclaim path at any given time.
> >
> > Race & Mitigation
> > ================
> > A race window exists between checking the 'draining' state and performing
> > the actual list operations. For instance, a reclaimer might observe the
> > draining state as false just before it changes, leading to a suboptimal
> > reclaim path decision.
> >
> > However, this impact is effectively mitigated by the kernel's reclaim
> > retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
> > to find eligible folios due to a state transition race, subsequent retries
> > in the loop will observe the updated state and correctly direct the scan
> > to the appropriate LRU lists. This ensures the transient inconsistency
> > does not escalate into a terminal OOM kill.
> >
> > This effectively reduce the race window that previously triggered OOMs
> > under high memory pressure.
> >
> > This fix has been verified on v7.0.0-rc1; dynamic toggling of MGLRU
> > functions correctly without triggering unexpected OOM kills.
> >
> > Reproduction
> > ===========
> >
> > The issue was consistently reproduced on v6.1.157 and v6.18.3 using a
> > high-pressure memory cgroup (v1) environment.
> >
> > Reproduction steps:
> > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > and 8GB active anonymous memory.
> > 2. Toggle MGLRU state while performing new memory allocations to force
> > direct reclaim.
> >
> > Reproduction script
> > ===================
> >
> > ```bash
> >
> > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> >
> > switch_mglru() {
> > local orig_val=$(cat "$MGLRU_FILE")
> > if [[ "$orig_val" != "0x0000" ]]; then
> > echo n > "$MGLRU_FILE" &
> > else
> > echo y > "$MGLRU_FILE" &
> > fi
> > }
> >
> > mkdir -p "$CGROUP_PATH"
> > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > echo $$ > "$CGROUP_PATH/cgroup.procs"
> >
> > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> >
> > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > sleep 5
> >
> > switch_mglru
> > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || \
> > echo "OOM Triggered"
> >
> > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > ```
> > ---
> > Changes in v5:
> > - Rename lru_gen_draining to lru_gen_switching; lru_drain_core to
> > lru_switch
> > - Add more documentation for folio_referenced_one
> > - Keep folio_check_references unchanged
> > - Link to v4: https://lore.kernel.org/r/20260318-b4-switch-mglru-v2-v4-1-1b927c93659d@xxxxxxxxx
> >
> > Changes in v4:
> > - Fix Sashiko.dev's AI CodeReview comments
> > Link: https://sashiko.dev/#/patchset/20260316-b4-switch-mglru-v2-v3-0-c846ce9a2321%40gmail.com
> > - Remove the patch maintain workingset refault context across
> > - Remove folio_lru_gen(folio) != -1 which involved in v2 patch
> > - Link to v3: https://lore.kernel.org/r/20260316-b4-switch-mglru-v2-v3-0-c846ce9a2321@xxxxxxxxx
> > ---
>
> A bit odd—I’ve seen v5, v4, and so on many times;
> at least three times?
>
> I’m starting to suspect my eyes are broken.
>
> I guess we might have a changelog issue here?
> Otherwise,
> Reviewed-by: Barry Song <baohua@xxxxxxxxxx>
>
> Thanks
> Barry