Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag

From: Christian Brauner

Date: Thu May 21 2026 - 07:55:22 EST

On Tue, May 19, 2026 at 01:53:29PM -0700, Minchan Kim wrote:
> On Sat, May 16, 2026 at 09:31:04AM -0700, Linus Torvalds wrote:
> > On Fri, 15 May 2026 at 22:47, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> > >
> > > Regarding proc_mem_open(), it actually operates very close to what you suggested.
> > > It acquires a reference to the mm_struct itself via mmgrab() but immediately
> > > unpins the address space memory via mmput(). Thus, no long-term mm_users
> > > reference is held across the open file descriptor.
> >
> > Ahh, and we've actually done that since 2012. How time flies..
> >
> > > The latency issue occurs during seqfile iteration (m_start/m_stop) in
> > > smaps/maps, or during get_cmdline() and ptrace_access_vm(), where the reader
> > > temporarily acquires mm_users via mmget_not_zero() or get_task_mm().
> >
> > Ok, so it's that much smaller region.
> >
> > How about a completely different approach then - make exit_mmap() just
> > take the mmap_write_lock() properly, and allow walking the vma's
> > without ever grabbing mm_users at all?
> >
> > IOW, just a mm_count ref would be sufficient to hold the mm_struct
> > around, and then the read-lock protects against exit_mm() actually
> > tearing the list down when the last "real" user goes away.
> >
> > [ exit_mm() is currently a bit odd - it does take the mmap_write lock,
> > but it *starts* with the read-lock.
> >
> > I'm not sure why it does that - it used to do the write lock over
> > the whole sequence, but that was changed in commit bf3980c85212 ("mm:
> > drop oom code from exit_mmap").
> >
> > Sure, read-lock allows more concurrency, but that would seem to be a
> > complete non-issue for exit_mmap(), and switching locking seems to
> > just complicate things.
> >
> > But that's a separate issue that I just happened to notice while
> > looking at this ]
> >
> > I may be missing something else again.
>
> Hi Linus,
>
> Sorry for the slow response.
> Thank you for the incredibly detailed feedback and the suggestions.
>
> Your proposal to avoid mm_users and synchronize via mmap_lock is an elegant
> conceptual cleanup. However, from the perspective of userspace OOM recovery,
> we hit two critical roadblocks that this alone cannot resolve:
>
> First, the -ESRCH race remains unsolved.
> Even if we don't grab mm_users, the victim process still clears its task->mm
> to NULL early in exit_mm(). Here is the timing mismatch:
>
> CPU A (Userspace OOM Killer) CPU B (Victim Task)
> ---------------------------- -------------------
> 1. Sends SIGKILL
> 2. Victim receives SIGKILL
> do_exit()
> exit_mm()
> task->mm = NULL <==== (Stops pinning mm)
> mmput()
> 3. Calls process_mrelease()
> (Looks up task->mm)
> (Sees NULL)
> Returns -ESRCH! <======================================== (Reaping fails!)
>
> Without Jann's patch to preserve the mm pointer via task->exit_mm, the
> userspace killer won't even have a chance to attempt reaping.
>
> Second, the latency bottleneck transfers from mmput() to mmap_lock.
> If a low-priority procfs reader is preempted or stalled while holding the
> mmap_read_lock, the exiting process calling exit_mmap() will block indefinitely
> when trying to acquire the mmap_write_lock.
>
> Crucially, if this lock contention occurs, process_mrelease() itself would
> also block on the same mmap_lock while trying to reap the memory, defeating the
> synchronous and expedited nature of the API.
>
> [An Alternative Proposal: Combining Kill and Reap via pidfd_send_signal()]
>
> Taking a step back, I believe the fundamental issue stems from separating
> the asynchronous "Kill" and synchronous "Reap" operations into two distinct
> system calls. Because userspace cannot predict when the victim will execute
> exit_mm(), the timing mismatch is practically unavoidable so the reaping
> doesn't work in the end.
>
> Since Christian understandably dislikes combining signaling semantics into
> process_mrelease(), perhaps we could solve this from the signal side.
>
> What if we introduce a new flag for pidfd_send_signal(), such as
> PIDFD_SIGNAL_PROCESS_GROUP_EXPEDITE?
>
> When invoked with this flag and SIGKILL, pidfd_send_signal() would deliver the
> fatal signal and immediately trigger the oom_reaper's VM zapping on the target
> mm within the same synchronous syscall context (where task->mm is guaranteed to
> be valid and easily locked).

Maybe. We would need to see what that actually looks like.

>
> This would completely eliminate the -ESRCH race by making the kill-and-reap
> operation atomic from userspace's perspective, while keeping each syscall
> focused strictly on its primary responsibility (signaling vs. reclaiming)
>
> Honestly, if we adopt this atomic interface, it might actually make the
> separate process_mrelease() syscall obsolete. I am not entirely sure about
> the historical reasons why they were split into two distinct APIs
> in the first place, but merging them into a single pidfd-based atomic
> operation seems much cleaner.
>
> I would highly appreciate everyone's thoughts on this perspective and
> alternative direction.
>
> >
> > Also, I do really hate the smap code. People have optimized it because
> > it's so piggy, but that code is still just silly. The "rollup" case in
> > particular knows how bad it is, and does that whole "unlock and relock
> > under contention" because it knows it's a horrible latency pig.
>
> And yes, I completely agree with your frustration on the smaps code—it is
> indeed a massive latency pig. In fact, userspace tools have increasingly moved
> away from smaps and even PSS (Proportional Set Size) altogether because they
> are simply too slow to be usable in production.
>
> >
> > Oh well. But it really feels like we *could* do this all without mm_users. No?
> >
> > Linus
> >