Re: [PATCH] mm/shmem: use invalidate_lock to fix hole-punch race
From: Pedro Falcato
Date: Thu Mar 26 2026 - 15:20:51 EST
On Thu, Mar 26, 2026 at 01:37:17PM -0500, Gregory Price wrote:
> On Thu, Mar 26, 2026 at 05:07:42PM +0000, Pedro Falcato wrote:
> > > Two races allow PTEs to be re-installed for a folio that fallocate
> > > is about to remove from page cache:
> >
> > Hmm, I don't see how your patch fixes anything.
> >
>
> after looking at your comments below i realized race 2 actually requires
> the fork as well, which means they're both essentially variations of the
> same race, so hopefully i can simplify the change log.
Well, then I don't see how changing shmem_fault() & map_mages() fixes fork.
>
> > > fallocate fault-around fork
> > > -------- ------------ ----
> > > set i_private
> > > unmap_mapping_range()
> > > # zaps PTEs
> > > filemap_map_pages()
> > > # re-maps folio!
> > > dup_mmap()
> > > # child VMA
> > > # in tree
> > > shmem_undo_range()
> > > lock folio
> > > unmap_mapping_folio()
> ^^^ i_mmap_lock_read held, iterates VMAs
> > spin_lock(ptl);
> ^^^ child VMA's PTL
> > > # child VMA:
> > > # no PTE, skip
> > spin_unlock(ptl);
> ^^^ child VMA done, iterator moves on
> it will not re-visit the child.
>
> > > copy_page_range()
> > spin_lock(dst_ptl);
> ^ Child PTL
> > spin_lock(src_ptl);
> ^ Parent PTL
> > /* does not copy PTE. either
> > * we find a zapped PTE, or unmap_mapping_folio()
> > * finds two mappings instead of one. */
>
> At this point, unmap_mapping_folio only processed the child VMA
> (no PTE, skip). The parent PTE *has not* been zapped.
>
> copy_page_range() acquires src_ptl (parent) and reads a present PTE,
> and boom copies it to child.
Sure, but can child - parent happen when traversing the i_mmap tree? I don't
think so? (in mm/mmap.c)
/* insert tmp into the share list, just after mpnt */
vma_interval_tree_insert_after(tmp, mpnt,
&mapping->i_mmap);
The function itself is somewhat straightforward - find the leftmost node at the
right of 'prev' (our parent) and link ourselves. So an in-order traversal should
always go parent - child. Unless there's some awful tree rotation that can
happen and screw us in the meanwhile.
>
> When it reaches the parent VMA next, it zaps the parent PTE,
> but the child PTE (just installed) survives.
>
> > >
> > > Fix both races with invalidate_lock.
> > >
> >
> > I don't see what you're seeing? Note that both map_pages and fault()
> > take the folio lock (map_pages does a trylock) to exclude against truncate
> > as well.
> >
>
> The folio lock serializes map_pages/fault against truncate - but the
> race isn't between those two. It's between truncate's unmap walk and
> fork's copy_page_range - and copy_page_range doesn't take folio lock.
If we observe everything parent - child, there is no way this is broken - if
fork observes the parent pte set, zap will have to observe parent *and* child,
since they hold the corresponding pte locks, and traversal is done in order.
If fork observes the parent pte as none, zap will have already traversed the
parent, and as such there will be no additional mapping of the folio.
If this is broken, then every filesystem out there using filemap_fault() and
filemap_fault_around() has to be broken, and I hope that's not true :p
_If_ there is indeed breakage here regarding tree rotations, I would suggest:
diff --git a/mm/mmap.c b/mm/mmap.c
index 5754d1c36462..7b4e39063d67 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1833,12 +1833,12 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
vma_interval_tree_insert_after(tmp, mpnt,
&mapping->i_mmap);
flush_dcache_mmap_unlock(mapping);
- i_mmap_unlock_write(mapping);
}
if (!(tmp->vm_flags & VM_WIPEONFORK))
retval = copy_page_range(tmp, mpnt);
-
+ if (file)
+ i_mmap_unlock_write(mapping);
if (retval) {
mpnt = vma_next(&vmi);
goto loop_out;
which should protect against concurrent rmap.
--
Pedro