Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)

From: T.J. Mercier

Date: Wed Mar 25 2026 - 20:10:53 EST

On Wed, Mar 25, 2026 at 2:07 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
>
> The Problem
> -----------
>
> Memory reclaim in the kernel is a mess. We ship two completely separate
> eviction algorithms -- traditional LRU and MGLRU -- in the same file.
> mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
> duplicates functionality already present in the traditional path. Every
> bug fix, every optimization, every feature has to be done twice or it
> only works for half the users. This is not sustainable. It has to stop.
>
> We should unify both algorithms into a single code path. In this path,
> both algorithms are a set of hooks called from that path. Everyone
> maintains, understands, and evolves a single codebase. Optimizations are
> now evaluated against -- and available to -- both algorithms. And the
> next time someone develops a new LRU algorithm, they can do so in a way
> that does not add churn to existing code.
>
> How We Got Here
> ---------------
>
> MGLRU brought interesting ideas -- multi-generation aging, page table
> scanning, Bloom filters, spatial lookaround. But we never tried to
> refactor the existing reclaim code or integrate these mechanisms into the
> traditional path. 3,300 lines of code were dumped as a completely
> parallel implementation with a runtime toggle to switch between the two.
> No attempt to evolve the existing code or share mechanisms between the
> two paths -- just a second reclaim system bolted on next to the first.
>
> To be fair, traditional reclaim is not easy to refactor. It has
> accumulated decades of heuristics trying to work for every workload, and
> touching any of it risks regressions. But difficulty is not an excuse.
> There was no justification for not even trying -- not attempting to
> generalize the existing scanning path, not proposing shared
> abstractions, not offering the new mechanisms as improvements to the code
> that was already there. Hard does not mean impossible, and the cost of
> not trying is what we are living with now.
>
> The Differences That Matter
> ---------------------------
>
> The two algorithms differ in how they classify pages, detect access, and
> decide what to evict. But most of these differences are not fundamental
> -- they are mechanisms that got trapped inside one implementation when
> they could benefit both. Not making those mechanisms shareable leaves
> potential free performance gains on the table.
>
> Access detection: Traditional LRU walks reverse mappings (RMAP) from the
> page back to its page table entries. MGLRU walks page tables forward,
> scanning process address spaces directly. Neither approach is inherently
> tied to its eviction policy. Page table scanning would benefit
> traditional LRU just as much -- it is cache-friendly, batches updates
> without the LRU lock, and naturally exploits spatial locality. There is
> no reason this should be MGLRU-only.
>
> Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold
> page table regions and a lookaround optimization to scan adjacent PTEs
> during eviction. These are general-purpose optimizations for any
> scanning path. They are locked inside MGLRU today for no good reason.
>
> Lock-free age updates: MGLRU updates folio age using atomic flag
> operations, avoiding the LRU lock during scanning. Traditional reclaim
> can use the same technique to reduce lock contention.
>
> Page classification: Traditional LRU uses two buckets
> (active/inactive). MGLRU uses four generations with timestamps and
> reference frequency tiers. This is the policy difference --
> how many age buckets and how pages move between them. Every other
> mechanism is shareable.
>
> Both systems already share the core reclaim mechanics -- writeback,
> unmapping, swap, NUMA demotion, and working set tracking. The shareable
> mechanisms listed above should join that common core. What remains after
> that is a thin policy layer -- and that is all that should differ between
> algorithms.
>
> The Fix: One Reclaim, Pluggable and Extensible
> -----------------------------------------------
>
> We need one reclaim system, not two. One code path that everyone
> maintains, everyone tests, and everyone benefits from. But it needs to
> be pluggable as there will always be cases where someone wants some
> customization for their specialized workload or wants to explore some
> new techniques/ideas, and we do not want to get into the current mess
> again.
>
> The unified reclaim must separate mechanism from policy. The mechanisms
> -- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are
> shared today and should stay shared. The policy decisions -- how to
> detect access, how to classify pages, which pages to evict, when to
> protect a page -- are where the two algorithms differ, and where future
> algorithms will differ too. Make those pluggable.
>
> This gives us one maintained code path with the flexibility to evolve.
> New ideas get implemented as new policies, not as 3,000-line forks. Good
> mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
> become shared infrastructure available to any policy. And if someone
> comes up with a better eviction algorithm tomorrow, they plug it in
> without touching the core.
>
> Making reclaim pluggable implies we define it as a set of function
> methods (let's call them reclaim_ops) hooking into a stable codebase we
> rarely modify. We then have two big questions to answer: how do these
> reclaim ops look, and how do we move the existing code to the new model?
>
> How Do We Get There
> -------------------
>
> Do we merge the two mechanisms feature by feature, or do we prioritize
> moving MGLRU to the pluggable model then follow with LRU once we are
> happy with the result?
>
> Whichever option we choose, we do the work in small, self-contained
> phases. Each phase ships independently, each phase makes the code
> better, each phase is bisectable. No big bang. No disruption. No
> excuses.
>
> Option A: Factor and Merge
>
> MGLRU is already pretty modular. However, we do not know which
> optimizations are actually generic and which ones are only useful for
> MGLRU itself.
>
> Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional
> changes to MGLRU. Traditional LRU code is left completely untouched at
> this stage.
>
> Phase 2 -- Merge the two paths one method at a time. Right now the code
> diverts control to MGLRU from the very top of the high-level hooks. We
> instead unify the algorithms starting from the very beginning of LRU and
> deciding what to keep in common code and what to move into a traditional
> LRU path.
>
> Advantages:
> - We do not touch LRU until Phase 2, avoiding churn.
> - Makes it easy to experiment with combining MGLRU features into
> traditional LRU. We do not actually know which optimizations are
> useful and which should stay in MGLRU hooks.
>
> Disadvantages:
> - We will not find out whether reclaim_ops exposes the right methods
> until we merge the paths at the end. We will have to change the ops
> if it turns out we need a different split. The reclaim_ops API will
> be private and have a single user so it is not that bad, but it may
> require additional changes.
>
> Option B: Merge and Factor
>
> Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page
> table scanning, Bloom filter PMD skipping, lookaround, lock-free folio
> age updates. These are independently useful. Make them available to both
> algorithms. Stop hoarding good ideas inside one code path.
>
> Phase 2 -- Collapse the remaining differences. Generalize list
> infrastructure to N classifications (trad=2, MGLRU=4). Unify eviction
> entry points. Common classification/promotion interface. At this point
> the two "algorithms" are thin wrappers over shared code.
>
> Phase 3 -- Define the hook interface. Define reclaim_ops around the
> remaining policy differences. Layer BPF on top (reclaim_ext).
> Traditional LRU and MGLRU become two instances of the same interface.
> Adding a third algorithm means writing a new set of hooks, not forking
> 3,000 lines.
>
> Advantages:
> - We get signals on what should be shared earlier. We know every shared
> method to be useful because we use it for both algorithms.
> - Can test LRU optimizations on MGLRU early.
>
> Disadvantages:
> - Slower, as we factor out both algorithms and expand reclaim_ops all
> at once.
>
> Open Questions
> --------------
>
> - Policy granularity: system-wide, per-node, or per-cgroup?
> - Mechanism/policy boundary: needs iteration; get it wrong and we
> either constrain policies or duplicate code.
> - Validation: reclaim quality is hard to measure; we need agreed-upon
> benchmarks.
> - Simplicity: the end result must be simpler than what we have today,
> not more complex. If it is not simpler, we failed.
> --
> 2.52.0
>

Hi Shakeel,

Nice outline, I'd be quite interested in this discussion. It's a
little difficult for me to imagine a reclaim_ops getting us to
complete convergence, but it seems like a good way to start making
progress.

Unfortuantely I got an LSFMM Invitation Decline, so I won't be there.
Take good notes. :)

-T.J.