Re: [PATCH v2 3/5] mm/shmem: introduce copy_zero_to_iter() for large zeroing

From: Matthew Wilcox

Date: Mon Jun 01 2026 - 11:21:31 EST

On Mon, Jun 01, 2026 at 05:02:01PM +0200, Mateusz Guzik wrote:
> On Mon, Jun 01, 2026 at 02:22:04PM +0100, Matthew Wilcox wrote:
> > On Mon, Jun 01, 2026 at 01:57:02PM +0800, Chi Zhiling wrote:
> > > Currently, holes larger than PAGE_SIZE cannot be handled because
> > > ZERO_PAGE is limited to a single page. Add copy_zero_to_iter() as a
> > > wrapper to support copying larger zero ranges to the iterator.
> >
> > I think Hugh put this optimisation in the wrong place, and you're
> > perpetuating that ;-)
> >
> > So perhaps we can start by moving this optimisation to lib/iov_iter.c?
> > And then you can redo your optimisation on top of that.
>
> This is a rather suspicious claim. If clear_user is indeed so terrible
> that it is faster to copy, the routine needs to get unfucked instead of
> the problem being worked around.

Oh, I agree. Putting it in lib/iov_iter.c means more people will see it
than if it's hidden in shmem.c.

> I can't speak for arm64 or other non-amd64 archs, maybe these are
> horrendeously broken.
>
> On amd64 some archeology shows the following:
> 1. 0db7058e8e23e6bb ("x86/clear_user: Make it faster")
>
> 2022 vintage, replaces thoroughly terrible 8-byte per-iteration write
> with rep stos usage

That's a good candidate for fixing this problem. 56a8c8eb1eaf is from
March 2022 and mentions the slowness of clear_user() on x86. So a
commit from May 2022 might have fixed the problem without anyone going
back to measure and remove the workaround.

> 2. 8c9b6a88b7e2f33c ("x86: improve on the non-rep 'clear_user' function")
>
> inlines rep stosb at the callsite if the CPU has FSRS, otherwise
> fallsback to a new routine which does 64-byte writes per loop iteration.
>
> FSRS is reasonably popular by now and chances are decent the test jig
> used by Chi has it.
>
> For a size like 4096 bytes, the 64-byte loop will be slower than rep
> movsb and even rep stosq. This needs to be patched and maybe I'll get
> around to doing the needful(tm) in few days (it's not hard to write, but
> some care with testing is needed).
>
> I could not be bothered to check how the workaround showed up, but it
> definitely needs to be removed as opposed to being perpetuated.

I'd be delighted if somebody tested just this patch. I'm not really set
up for performance testing here.

diff --git a/mm/shmem.c b/mm/shmem.c
index 3b5dc21b323c..112cae9f9e4f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3427,19 +3427,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
else
ret = copy_page_to_iter(page, offset, nr, to);
folio_put(folio);
- } else if (user_backed_iter(to)) {
- /*
- * Copy to user tends to be so well optimized, but
- * clear_user() not so much, that it is noticeably
- * faster to copy the zero page instead of clearing.
- */
- ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
} else {
- /*
- * But submitting the same page twice in a row to
- * splice() - or others? - can result in confusion:
- * so don't attempt that optimization on pipes etc.
- */
ret = iov_iter_zero(nr, to);
}