Re: [PATCH v2 3/5] mm/shmem: introduce copy_zero_to_iter() for large zeroing

From: Mateusz Guzik

Date: Mon Jun 01 2026 - 11:47:37 EST

On Mon, Jun 1, 2026 at 5:14 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jun 01, 2026 at 05:02:01PM +0200, Mateusz Guzik wrote:
> > On Mon, Jun 01, 2026 at 02:22:04PM +0100, Matthew Wilcox wrote:
> > On amd64 some archeology shows the following:
> > 1. 0db7058e8e23e6bb ("x86/clear_user: Make it faster")
> >
> > 2022 vintage, replaces thoroughly terrible 8-byte per-iteration write
> > with rep stos usage
>
> That's a good candidate for fixing this problem. 56a8c8eb1eaf is from
> March 2022 and mentions the slowness of clear_user() on x86. So a
> commit from May 2022 might have fixed the problem without anyone going
> back to measure and remove the workaround.
>

Ok in that case that's basically settled I think.

> I'd be delighted if somebody tested just this patch. I'm not really set
> up for performance testing here.
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 3b5dc21b323c..112cae9f9e4f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3427,19 +3427,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> else
> ret = copy_page_to_iter(page, offset, nr, to);
> folio_put(folio);
> - } else if (user_backed_iter(to)) {
> - /*
> - * Copy to user tends to be so well optimized, but
> - * clear_user() not so much, that it is noticeably
> - * faster to copy the zero page instead of clearing.
> - */
> - ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
> } else {
> - /*
> - * But submitting the same page twice in a row to
> - * splice() - or others? - can result in confusion:
> - * so don't attempt that optimization on pipes etc.
> - */
> ret = iov_iter_zero(nr, to);
> }
>

I'm not in position to test myself at the moment. But should someone
be interested and try this on amd64, either make sure you have FSRS or
in the worst case apply this hack which forces rep stosb:

diff --git a/arch/x86/include/asm/uaccess_64.h
b/arch/x86/include/asm/uaccess_64.h
index 20de34cc9aa6..552d1b883579 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -189,8 +189,7 @@ static __always_inline __must_check unsigned long
__clear_user(void __user *addr
*/
asm volatile(
"1:\n\t"
- ALTERNATIVE("rep stosb",
- "call rep_stos_alternative",
ALT_NOT(X86_FEATURE_FSRS))
+ "rep stosb\n"
"2:\n"
_ASM_EXTABLE_UA(1b, 2b)
: "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT

It is always fine to use it from correctness pov. It would cause a
slowdown for any size on old yellers predating Sandy Bridge (that's 14
years now I think?) or for small sizes on CPUs without FSRS. Given
something like 4K I/O that's not a concern.