Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry

From: Lorenzo Stoakes

Date: Tue May 26 2026 - 12:15:21 EST

On Wed, May 20, 2026 at 04:03:03PM +0200, David Hildenbrand (Arm) wrote:
> On 5/20/26 15:48, David Hildenbrand (Arm) wrote:
> > On 5/20/26 14:53, Mike Rapoport wrote:
> >> On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
> >>>
> >>> As this is all uffd specific, I wonder whether that should be "struct
> >>> uffd_vma_snapshot"/"vma_uffd_snapshot" etc.
> >>
> >> It's local to this file, I wouldn't worry about namespacing until there's
> >> an actual need if it will ever arise.
> >
> > Indeed, fair enough.
> >
> >>
> >>> From a high level, this LGTM.
> >>>
> >>> I wish we could identify relevant VMA changes more easily. Like, using a per-MM
> >>> sequence counter that we simply increment on any VMA changes.
> >>
> >> Do you mean per-VMA?
> >> Per-MM counter would capture unrelated changes, e.g. an masvise() for
> >> unrelated range.
> >
> > I was thinking about a per-MM thing for simplicity. If there were any changes,
> > we'd retry (-EAGAIN).
> >
> > IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.
> >
> > Not sure if unrelated changes would really be a problem in practice (another
> > thread gabbing the mmap_lock in write mode until we serviced the fault).
>
> Assume something like the following:
>
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 180bad42fc79..608bb3c3a613 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -448,9 +448,12 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
> {
> const struct vm_uffd_ops *orig_ops = vma_uffd_ops(state->vma);
> unsigned long src_addr = state->src_addr;
> + unsigned int seq;
> void *kaddr;
> int err;
>
> + seq = raw_read_seqcount(&mm->mm_lock_seq);
> +
> /* retry copying with mm_lock dropped */
> mfill_put_vma(state);
>
> @@ -472,7 +475,7 @@ static int mfill_copy_folio_retry(struct mfill_state *state,
> * (e.g. replaced with a hugetlb mapping), making the caller's
> * ops pointer stale.
> */
> - if (vma_uffd_ops(state->vma) != orig_ops)
> + if (read_seqcount_retry(&mm->mm_lock_seq, seq))
> return -EAGAIN;
>
> err = mfill_establish_pmd(state);
>
>
> In case there is concurrent mmap_write lock, we'd retry. But we triggered and
> resolved the fault. So the retry would just succeed in copying from the page,
> not triggering another fault here.
>
> So I'd not expect this to matter in practice?
>
>
> We'd have to implement &mm->mm_lock_seq also without per-VMA locks. Which
> wouldn't be too bad given that Dave proposed a patch set to just enable per-VMA
> locks unconditionally.

That set unfortunately is broken for its ostensible purpose AFAICT, but the
idea of making per-VMA locks permanently available is good and I hope that
we land that at some point.

>
> As a hotfix, what you have is probably better, but a simplification like this
> might later make sense.

Hmmm, but the proposed logic absolutely tolerates changes to the VMA that
does not alter the fundamental properties being checked for, so this is an
entirely different check no?

I don't think we can both say that we only care about properties X, Y, and
Z changing and simultaneously tracking whether a VMA write lock was applied
since we dropped the lock?

Also a VMA write lock having been obtained doesn't mean the VMA was necessarily
changed?

And then couldn't -EAGAIN cause spurious write locks to abort the operation
altogether?

Given reacquisition of the lock will fall back to waiting on the
mmap_read_lock() if the VMA was write locked, that makes a race very
possible.

I mean maybe we want to tolerate all of the above and just move to a safer
world where we conservatively consider any VMA write lock acquisition to be
one in which we abort the operation, which I think would be nicer...

>
> --
> Cheers,
>
> David

Cheers, Lorenzo