Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

From: Li Zhe

Date: Wed May 20 2026 - 08:01:42 EST


On Wed, 20 May 2026 09:20:18 +0300, rppt@xxxxxxxxxx wrote:

> On Mon, May 18, 2026 at 04:57:00PM +0800, Li Zhe wrote:
> > On Mon, 18 May 2026 09:23:33 +0300, rppt@xxxxxxxxxx wrote:
> > > On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> > > >
> > > > Performance
> > > > ===========
> > > > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > > > Base(v7.1-rc3):
> > > > First binding: 1486 ms
> > > > Average of subsequent rebinds: 273.52 ms
> > > > Full series:
> > > > First binding: 1272 ms
> > > > Average of subsequent rebinds: 104.59 ms
> > > >
> > > > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > > > Base(v7.1-rc3):
> > > > First binding: 1515 ms
> > > > Average of subsequent rebinds: 313.45 ms
> > > > Full series:
> > > > First binding: 1286 ms
> > > > Average of subsequent rebinds: 116.93 ms
> > >
> > > This is really good improvement!
> > >
> > > It would be also interesting to see how the template approach would improve
> > > "normal" memory map initialization.
> >
> > I also experimented with this approach earlier. Unfortunately, in the
> > normal memory map initialization path, functions such as
> > deferred_free_pages() are invoked shortly after struct page
> > initialization, and this function performs both read and write accesses
> > to members of the struct page.
> >
> > Non-temporal stores via MOVNTI are primarily beneficial for streaming
> > write operations, where the cache lines written are not expected to be
> > reused by the CPU in the near future. In this case, however, data
> > written using MOVNTI is immediately accessed again through regular load
> > and store instructions. This results in an access pattern that resembles
> > a write-then-reuse workload rather than a pure streaming store.
> >
> > Consequently, non-temporal stores do not deliver the expected reduction
> > in cache pollution, and using MOVNTI provides no measurable performance
> > benefit for this particular workload.
>
> We can split initialization and freeing into separate loops if there is
> overall benefit, but this needs to be verified on other major architectures
> as well.

I agree with your point.

> > That said, a template-based approach can still accelerate initialization.
> > Based on measurements from this patchset, it should improve performance
> > on the generic path by roughly 10%. I would appreciate feedback on
> > whether such an optimization is still considered useful.
>
> Improving the memory map initialization by 10% is valuable.

Thank you for your feedback. I will try the optimization after finishing
the current patchset.

Thanks,
Zhe