Re: [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc_copy()
From: David Hildenbrand (Arm)
Date: Mon May 18 2026 - 04:44:54 EST
On 5/14/26 07:17, Garg, Shivank wrote:
>
>
> On 5/12/2026 3:01 PM, David Hildenbrand (Arm) wrote:
>> On 4/27/26 16:20, Shivank Garg wrote:
>>> Rewrite folio_copy() and folio_mc_copy() as thin wrappers around new
>>> batched helpers copy_highpages() and copy_mc_highpages().
>>>
>>> The current implementations iterate copy_highpage() (or its #MC-aware
>>> variant) per 4 KB page. For a single 2 MB folio that loop runs 512
>>> times and pays, per page:
>>>
>>> - kmap_local_page() / kunmap_local()
>>> - cond_resched()
>>> - one invocation of the architecture copy_page()/memcpy() primitive
>>>
>>> The new helpers issue a single copy_mc_to_kernel()/memcpy() over
>>> the whole contiguous range when CONFIG_HIGHMEM is off and no
>>> architecture overrides (__HAVE_ARCH_COPY_HIGHPAGE) copy_highpage().
>>> HIGHMEM and arch overrides keep the existing per-page path.
>>>
>>> Tested on dual-socket AMD EPYC 9655 (Zen 5) with a CXL.mem node.
>>> In-kernel folio_mc_copy() microbenchmark on 2 MB folios, source
>>> evicted from cache before each iteration and measured throughput:
>>>
>>> direction baseline GB/s optimized GB/s speedup
>>> DRAM0 -> DRAM1 18.65 ± 1.37 38.03 ± 3.21 2.04x
>>> DRAM0 -> CXL 25.46 ± 2.89 39.29 ± 1.17 1.54x
>>> CXL -> DRAM0 20.61 ± 3.95 35.07 ± 0.62 1.70x
>>>
>>> End-to-end move_pages(2) throughput on anonymous 2 MB mTHP folios,
>>> 1 GB migrated per run:
>>>
>>> direction baseline GB/s optimized GB/s speedup
>>> DRAM0 -> DRAM1 7.20 ± 0.03 8.01 ± 0.02 1.11x
>>> DRAM0 -> CXL 11.12 ± 0.15 13.07 ± 0.03 1.18x
>>> DRAM1 -> DRAM0 7.21 ± 0.02 7.95 ± 0.02 1.10x
>>> CXL -> DRAM0 9.10 ± 0.05 9.49 ± 0.01 1.04x
>>>
>>> On AMD EPYC 7713 (Zen 3 / Milan, REP_GOOD without FSRM/ERMS) the
>>> folio_copy() bulk path regresses because memcpy() falls through to
>>> memcpy_orig (an unrolled movq loop), which is slower than the
>>> per-page copy_page() (microcoded rep movsq) it replaces.
>>
>> Do you know what the reason for that fallback is? Could it be fixed (e.g., when
>> we detect page alignment or sth like that?)
>>
>
> The fallback is gated on X86_FEATURE_FSRM in arch/x86/lib/memcpy_64.S:
>
> SYM_TYPED_FUNC_START(__memcpy)
> ALTERNATIVE "jmp memcpy_orig", "", X86_FEATURE_FSRM
> movq %rdi, %rax
> movq %rdx, %rcx
> rep movsb
> RET
>
> AMD Zen 3 does not have FSRM, so it jmp to memcpy_orig (unrolled movq loop).
>
>
> On v7.1.0-rc3, I measured these primitives and the kernel's actual memcpy()
> across three CPUs, using a kernel module that vmallocs 16MB src/dst buffer
> and times each primitive for comparison.
> Numbers are mean (in GB/s) ± SD% (= SD as percent of mean).
>
>
> 1.) AMD EPYC 7713 (Zen 3), Flags: rep_good only, no ERMS/FSRM:
>
> size unrolled_movq GB/s±SD% rep_movsq GB/s±SD% kernel_memcpy GB/s±SD%
> ------------------------------------------------------------------------------
> 16B 0.38± 8.73% 0.41± 0.43% 0.43± 0.31%
> 32B 0.85± 0.19% 0.80± 8.37% 0.84± 0.07%
> 64B 1.68± 0.35% 1.60± 0.03% 1.59± 9.37%
> 128B 3.23± 0.22% 3.04± 0.62% 3.19± 0.03%
> 256B 5.99± 5.78% 5.62± 4.15% 5.93± 0.42%
> 512B 10.07± 1.36% 10.49± 2.60% 10.02± 0.21%
> 1K 14.49± 0.09% 18.19± 0.37% 14.31± 3.48%
> 2K 17.11± 1.01% 28.04± 2.37% 18.14± 0.56%
> 4K 18.36± 0.22% 39.15± 0.50% 19.57± 1.14%
>
> - kernel_memcpy is tracking unrolled_movq.
> - rep_movsq is 1.4x-2x faster than the unrolled_movq fallback for >= 1 KiB.
>
> 2.) On Intel(R) Xeon(R) Platinum 8362
> Flags: rep_good, erms, fsrm
>
> size unrolled_movq GB/s±SD% rep_movsq GB/s±SD% rep_movsb GB/s±SD% kernel_memcpy GB/s±SD%
> --------------------------------------------------------------------------------------------
> 16B 0.89± 0.93% 0.64± 0.10% 0.69± 0.57% 0.66± 3.52%
> 32B 2.08± 2.46% 1.28± 0.15% 1.38± 6.21% 1.33± 4.28%
> 64B 3.97± 2.26% 2.55± 0.24% 2.83± 0.22% 2.65± 4.48%
> 128B 7.45± 0.09% 5.00± 2.53% 5.48± 5.04% 5.30± 1.60%
> 256B 13.24± 0.01% 9.79± 0.57% 10.12± 0.37% 9.81± 0.34%
> 512B 21.67± 0.03% 17.87± 0.02% 18.43± 0.79% 17.81± 0.25%
> 1K 27.84± 1.96% 34.54± 1.24% 35.67± 1.88% 34.56± 2.49%
> 2K 32.67± 2.35% 59.58± 0.01% 65.67± 0.18% 59.35± 1.12%
> 4K 34.85± 0.64% 95.35± 0.00% 96.64± 0.69% 95.35± 0.00%
>
> - kernel_memcpy is using rep_movsb (FSRM in use).
> - Below 512 B the unrolled movq loop is ~20-50% faster, >1 KiB FSRM wins.
>
> 3.) On AMD EPYC 9655 96-Core Processor (Zen 5)
> Flags: rep_good, erms, fsrm
>
> size unrolled_movq GB/s±SD% rep_movsq GB/s±SD% rep_movsb GB/s±SD% kernel_memcpy GB/s±SD%
> --------------------------------------------------------------------------------------------
> 16B 0.53± 0.39% 0.53± 0.21% 0.55± 0.13% 0.53± 0.14%
> 32B 1.13± 1.49% 1.06± 0.07% 1.09± 0.16% 1.06± 0.09%
> 64B 2.21± 0.12% 2.13± 0.07% 2.18± 0.14% 2.13± 0.09%
> 128B 4.25± 0.12% 4.26± 0.10% 4.37± 0.12% 4.31± 0.14%
> 256B 8.01± 0.19% 8.61± 0.27% 8.61± 0.18% 8.51± 0.10%
> 512B 14.14± 0.18% 16.80± 0.24% 16.80± 0.23% 16.81± 0.24%
> 1K 22.93± 0.73% 31.70± 0.48% 32.37± 0.28% 32.02± 0.22%
> 2K 30.36± 0.27% 53.24± 1.01% 56.58± 0.22% 56.04± 0.22%
> 4K 35.05± 0.65% 80.25± 0.41% 83.90± 0.20% 76.23± 0.37%
>
> - kernel_memcpy is using rep_movsb (FSRM in use).
> - For smaller size, unrolled movq are close enough to be within noise.
>
>
> Regarding the fix,
> One option is to make memcpy() fall back to rep movsq instead of unrolled
> movq loop when FSRM is absent. The data shows the benefit on Zen 3. For
> the Intel, unrolled movq is faster for smaller sizes.
> But, I'm not sure if adding these complexities to memcpy() is welcome.
> Happy to work on this if it is helpful.
I was wondering whether optimizing memcpy() further would be of value elsewhere.
Of course, we wouldn't want to degrade it :)
Some direction from x86 folks would be nice.
>
> Another option is to leave memcpy() untouched for this series and add
> a new copy_pages() helper that the folio copy path can use. It would
> use ALTERNATIVE_2 that picks rep movsb on ERMS/FSRM and rep movsq on
> REP_GOOD and per-page copy_page() loop as the final fallback.
That would fit the clear_pages() design we have. But if that's avoidable, that
would be nice.
--
Cheers,
David