Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

From: Li Zhe

Date: Mon May 18 2026 - 05:02:26 EST

On Mon, 18 May 2026 09:23:33 +0300, rppt@xxxxxxxxxx wrote:

> Hi,
>
> On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> > memmap_init_zone_device() can spend a substantial amount of time
> > initializing large ZONE_DEVICE ranges because it repeats nearly
> > identical struct page setup for every PFN.
> >
> > This series reduces that overhead in four steps.
> >
> > Testing
> > =======
> >
> > Tests were run in a VM on an Intel Ice Lake server.
> >
> > Two PMEM configurations were used:
> > - a 100 GB fsdax namespace configured with map=dev, which exercises
> > the nd_pmem rebind path (pfns_per_compound == 1)
> > - a 100 GB devdax namespace configured with align=2097152, which
> > exercises the dax_pmem rebind path (pfns_per_compound > 1)
> >
> > For each configuration, the corresponding driver was unbound and
> > rebound 30 times. Memmap initialization latency was collected from the
> > pr_debug() output of memmap_init_zone_device().
> >
> > The first bind is reported separately, and the average of subsequent
> > rebinds is used as the steady-state result.
> >
> > Performance
> > ===========
> > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > Base(v7.1-rc3):
> > First binding: 1486 ms
> > Average of subsequent rebinds: 273.52 ms
> > Full series:
> > First binding: 1272 ms
> > Average of subsequent rebinds: 104.59 ms
> >
> > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > Base(v7.1-rc3):
> > First binding: 1515 ms
> > Average of subsequent rebinds: 313.45 ms
> > Full series:
> > First binding: 1286 ms
> > Average of subsequent rebinds: 116.93 ms
>
> This is really good improvement!
>
> It would be also interesting to see how the template approach would improve
> "normal" memory map initialization.

I also experimented with this approach earlier. Unfortunately, in the
normal memory map initialization path, functions such as
deferred_free_pages() are invoked shortly after struct page
initialization, and this function performs both read and write accesses
to members of the struct page.

Non-temporal stores via MOVNTI are primarily beneficial for streaming
write operations, where the cache lines written are not expected to be
reused by the CPU in the near future. In this case, however, data
written using MOVNTI is immediately accessed again through regular load
and store instructions. This results in an access pattern that resembles
a write-then-reuse workload rather than a pure streaming store.

Consequently, non-temporal stores do not deliver the expected reduction
in cache pollution, and using MOVNTI provides no measurable performance
benefit for this particular workload.

That said, a template-based approach can still accelerate initialization.
Based on measurements from this patchset, it should improve performance
on the generic path by roughly 10%. I would appreciate feedback on
whether such an optimization is still considered useful.

Thanks,
Zhe