Re: [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc_copy()

From: Garg, Shivank

Date: Tue May 19 2026 - 03:50:42 EST

On 5/18/2026 9:31 PM, Borislav Petkov wrote:
> On Mon, May 18, 2026 at 10:43:22AM +0200, David Hildenbrand (Arm) wrote:
>> I was wondering whether optimizing memcpy() further would be of value elsewhere.
>>
>> Of course, we wouldn't want to degrade it :)
>>
>> Some direction from x86 folks would be nice.
>
> We can talk about it... :)
>
> With the proper perf numbers to back the changes up and if we don't regress
> others, we could look at improving the !FSRM situation.
>
> Make sure to CC Linus on those patches as he likes to look at optimizations
> there, as one can see from who changed things around that area.
>
> :-)
>

Hi Boris,

Sure, I'll keep Linus in CC.

So, the story so far:
- Current folio_copy/folio_mc_copy() do not fully utilize the potential of rep
movsb for large pages (up to 2X speedup with FSRM)
- On Zen 3 (no FSRM), memcpy falls back to unrolled movq (memcpy_orig). (using
rep movsq instead would give 1.4-2X speedups over mempcy_orig for >= 1K blocks)
- folio_copy and the mc variant are asymmetric in their primitive selection
(memcpy, rep movsq):

folio_copy -> copy_highpage() -> copy_page() = rep movsq [REP_GOOD]
= unrolled movq [otherwise]

folio_mc_copy -> copy_mc_highpage() -> copy_mc_to_kernel() = copy_mc_fragile [if enabled]
= copy_mc_enhanced_fast_string (rep movsb) [ERMS]
= memcpy() [otherwise]
= rep movsb [FSRM]
= unrolled movq [otherwise]

Introducing a new "copy_pages()" helper can make this symmetric and
optimize the bulk path.
memcpy() and copy_page*() are fundamentally different in that memcpy
callers don't guarantee alighment.

However, David suggested avoiding the new helper would be nice.
The direction from thread is to improve memcpy for !FSRM situation
(without regressing others), this would address the Zen3 regression
we see when folio_copy switches to bulk memcpy(), while keeping the
gains on FSRM CPUs.

What if we modify memcpy to ALTERNATIVE_2, adding a rep movsq for
REP_GOOD between the FSRM path and unrolled movq fallback?

I'll gather more data for memcpy experiment.

Thanks,
Shivank