Re: [PATCH v2 0/7] mm: speed up ZONE_DEVICE memmap initialization

From: Li Zhe

Date: Fri May 22 2026 - 03:51:32 EST

On Thu, 21 May 2026 15:20:29 -0700, akpm@xxxxxxxxxxxxxxxxxxxx wrote:

> On Thu, 21 May 2026 12:01:17 +0800 "Li Zhe" <lizhe.67@xxxxxxxxxxxxx> wrote:
>
> > memmap_init_zone_device() can spend a substantial amount of time
> > initializing large ZONE_DEVICE ranges because it repeats nearly
> > identical struct page setup for every PFN.
> >
> > This series reduces that overhead in seven steps.
>
> Cool, thanks, we all love speedups.
>
> > The first patch factors the reusable pieces out of
> > __init_zone_device_page() so later patches can share the same logic
> > without changing the existing slow path.
> >
> > The second patch adds set_page_section_from_pfn(), so generic callers
> > can update section bits from a PFN without open-coding
> > SECTION_IN_PAGE_FLAGS checks.
> >
> > The third patch adds a template-based fast path for ZONE_DEVICE head
> > pages. Instead of rebuilding the same struct page state for every PFN,
> > it prepares a reusable page template once and copies it to each
> > destination page.
> >
> > The fourth patch extends the same template-based approach to compound
> > tails, so pfns_per_compound > 1 can also benefit from the fast path.
> >
> > The fifth patch introduces memcpy_streaming() and
> > memcpy_streaming_drain() as a generic interface for write-once
> > streaming copies, with a memcpy() fallback for architectures that do
> > not provide a specialized backend.
> >
> > The sixth patch extends x86 memcpy_flushcache() small fixed-size
> > fastpaths so struct-page-sized streaming copies can stay on the inline
> > path.
> >
> > The last patch switches the zone-device template-copy path over to
> > memcpy_streaming(). It refreshes PFN-dependent fields in the reusable
> > template before each copy, keeps pageblock-aligned PFNs on regular
> > memcpy(), and drains streaming stores before later normal stores update
> > overlapping or dependent metadata.
> >
> > The optimized path is disabled when the page_ref_set tracepoint is
> > enabled, sanitized builds remain on the slow path so their
> > instrumented stores are preserved, and the fast path falls back to the
> > existing slow path if sizeof(struct page) is not an integral number of
> > u64 words.
> >
> > Testing
> > =======
> >
> > Tests were run in a VM on an Intel Ice Lake server.
> >
> > Two PMEM configurations were used:
> > - a 100 GB fsdax namespace configured with map=dev, which exercises
> > the nd_pmem rebind path (pfns_per_compound == 1)
> > - a 100 GB devdax namespace configured with align=2097152, which
> > exercises the dax_pmem rebind path (pfns_per_compound > 1)
> >
> > For each configuration, the corresponding driver was unbound and
> > rebound 30 times. Memmap initialization latency was collected from the
> > pr_debug() output of memmap_init_zone_device().
> >
> > The first bind is reported separately, and the average of subsequent
> > rebinds is used as the steady-state result.
>
> How closely does this workload resemble any real-world user workload?

Not directly. The unbind/rebind loop is mainly a controlled and
repeatable way to measure the memmap_init_zone_device() path with minimal
unrelated noise.

> > Performance
> > ===========
> >
> > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > Base(v7.1-rc3):
> > First binding: 1486 ms
> > Average of subsequent rebinds: 273.52 ms
> > With patches 1-3 applied:
> > First binding: 1422 ms
> > Average of subsequent rebinds: 245.73 ms
> > Full series:
> > First binding: 1389 ms
> > Average of subsequent rebinds: 111.08 ms
> >
> > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > Base(v7.1-rc3):
> > First binding: 1515 ms
> > Average of subsequent rebinds: 313.45 ms
> > With patches 1-4 applied:
> > First binding: 1422 ms
> > Average of subsequent rebinds: 256.56 ms
> > Full series:
> > First binding: 1294 ms
> > Average of subsequent rebinds: 110.24 ms
>
> The improvements appear to range between "modest" and "large", but what
> I'd like to understand is how frequently real-world users are using
> these operations in real-world workloads.
>
> IOW, (and this is always the bottom line), how valuable is this
> patchset to our users?

This is not a steady-state data-path optimization. Its value is in pmem
bring-up paths, and in our deployment we do have scenarios where
multiple pmem devices are hotplugged , so reducing this latency is useful
in practice for us.

> > mm: factor zone-device page init helpers out of
> > __init_zone_device_page
> > mm: add a set_page_section_from_pfn() helper
> > mm: add a template-based fast path for zone-device page init
> > mm: extend the template fast path to zone-device compound tails
> > string: introduce memcpy_streaming() helpers
> > x86/string: extend memcpy_flushcache() fixed-size fastpaths
> > mm: use memcpy_streaming() in zone-device template copies
> >
> > arch/x86/include/asm/string_64.h | 100 +++++++++++++---
> > include/linux/mm.h | 19 ++-
> > include/linux/string.h | 18 +++
> > mm/mm_init.c | 198 +++++++++++++++++++++++++++----
> > 4 files changed, 294 insertions(+), 41 deletions(-)
>
> I won't take any action at this stage - let's await reviewer input. If
> none is forthcoming then please remind me and I'll figure out what to
> do.
>
> The ever-present reviewer called "Sashiko" has thoughts to offer:
>
> https://sashiko.dev/#/patchset/20260521040124.10608-1-lizhe.67@xxxxxxxxxxxxx
>
> Please take a look, decide if there's useful material in there.

There is useful material there, mainly around patches 5 and 6.

The memcpy_streaming() x86 backend should be narrower, and the expanded
memcpy_flushcache() small-copy fastpath should keep naturally aligned
cases only and preserve forward movnti store order.

I'll address those points in the next revision and rerun the benchmarks.

Thanks,
Zhe