Re: [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space

From: Suren Baghdasaryan

Date: Mon Mar 23 2026 - 13:38:09 EST


On Mon, Mar 23, 2026 at 10:06 AM Lorenzo Stoakes (Oracle)
<ljs@xxxxxxxxxx> wrote:
>
> On Mon, Mar 23, 2026 at 09:19:04AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Mar 23, 2026 at 6:43 AM Lorenzo Stoakes (Oracle) <ljs@xxxxxxxxxx> wrote:
> > >
> > > On Sun, Mar 22, 2026 at 12:08:43AM -0700, Suren Baghdasaryan wrote:
> > > > When shrinking lruvec, MGLRU pins address space before walking it.
> > > > This is excessive since all it needs for walking the page range is
> > > > a stable mm_struct to be able to take and release mmap_read_lock and
> > > > a stable mm->mm_mt tree to walk. This address space pinning results
> > >
> > > Hmm, I guess exit_mmap() calls __mt_destroy(), but that'll just destroy
> > > allocated state and leave the tree empty right, so traversal of that tree
> > > at that point would just do nothing?
> >
> > Correct. And __mt_destroy() happens under mmap_write_lock while
> > traversal under mmap_read_lock, so they should not race.
>
> Yeah that's fair.
>
> >
> > >
> > > > in delays when releasing the memory of a dying process. This also
> > > > prevents mm reapers (both in-kernel oom-reaper and userspace
> > > > process_mrelease()) from doing their job during MGLRU scan because
> > > > they check task_will_free_mem() which will yield negative result due
> > > > to the elevated mm->mm_users.
> > > >
> > > > Replace unnecessary address space pinning with mm_struct pinning by
> > > > replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained
> > > > within mm_struct itself, therefore it won't be freed as long as
> > > > mm_struct is stable and it won't change during the walk because
> > > > mmap_read_lock is being held.
> > > >
> > > > Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
> > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
>
> Given you have cleared up my concerns, this LGTM, so:
>
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@xxxxxxxxxx>

Thanks!

>
> > > > ---
> > > > mm/vmscan.c | 5 +++--
> > > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 33287ba4a500..68e8e90e38f5 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2863,8 +2863,9 @@ static struct mm_struct *get_next_mm(struct lru_gen_mm_walk *walk)
> > > > return NULL;
> > >
> > > Not related to this series, but I really don't like how coupled MGLRU is to
> > > the rest of the 'classic' reclaim code.
> > >
> > > Just in the middle of vmscan you walk into generic mm walker logic and the
> > > only hint it's MGLRU is you see lru_gen_xxx stuff (I'm also annoyed that we
> > > call it MGLRU but it's called lru_gen_xxx in the kernel :)
> >
> > I don't have a strong opinion on this. Perhaps the naming can be
> > changed outside of this series.
>
> I was thinking more of a new file for mglru :>)
>
> I believe we also need some more active maintainership also... but that's
> another issue ;)

Yes, in my team I volunteered one developer to actively review and
support MGLRU. Over time I hope he will be in a position to help with
maintenance.

>
> >
> > >
> > > >
> > > > clear_bit(key, &mm->lru_gen.bitmap);
> > > > + mmgrab(mm);
> > >
> > > Is the mm somehow pinned here or, on destruction, would move it from the mm
> > > list meaning that we can safely assume we have something sane in mm-> to
> > > grab? I guess this must have already been the case for mmget_not_zero() to
> > > have been used before though.
> >
> > Yes, mm is stable because it's fetched from mm_list. When mm is added
> > to this list via lru_gen_add_mm(mm) it is referenced and that
> > reference is dropped only after lru_gen_del_mm(mm) removes the mm from
> > this list (see https://elixir.bootlin.com/linux/v7.0-rc4/source/kernel/fork.c#L1185
> > and https://elixir.bootlin.com/linux/v7.0-rc4/source/kernel/fork.c#L1187).
> > Addition, removal and retrieval from that list happen under
> > mm_list->lock which prevents races.
>
> Ack, thanks!
>
> >
> > >
> > > >
> > > > - return mmget_not_zero(mm) ? mm : NULL;
> > > > + return mm;
> > > > }
> > > >
> > > > void lru_gen_add_mm(struct mm_struct *mm)
> > > > @@ -3064,7 +3065,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *walk, struct mm_struct **ite
> > > > reset_bloom_filter(mm_state, walk->seq + 1);
> > > >
> > > > if (*iter)
> > > > - mmput_async(*iter);
> > > > + mmdrop(*iter);
> > >
> > > This will now be a blocking call that could free the mm (via __mmdrop()),
> > > so could take a while, is that ok?
> >
> > mmdrop() should not be a heavy-weight operation. It simply destroys
> > the metadata associated with mm_struct. mmput() OTOH will call
> > exit_mmap() if it drops the last reference and that can take a while
> > because that's when we free the memory of the process. I believe
> > that's why mmput_async() was used here.
>
> Yeah that's fair enough! Thanks.
>
> >
> > >
> > > If before the code was intentionally deferring work here, doesn't that
> > > imply that being slow here might be an issue, somehow? Or was it just
> > > because they could? :)
> >
> > I think the reason was the possibility of calling mmput() -> __mmput()
> > -> exit_mmap(mm) which could indeed block us for a while.
>
> Yeah fair :)
>
> >
> > >
> > > >
> > > > *iter = mm;
> > > >
> > > >
> > > > base-commit: 8c65073d94c8b7cc3170de31af38edc9f5d96f0e
> > > > --
> > > > 2.53.0.1018.g2bb0e51243-goog
> > > >
> > >
> > > Thanks, Lorenzo
>
> Cheers, Lorenzo