Re: Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12

From: Chengfeng Lin

Date: Fri May 22 2026 - 05:04:28 EST

Hi David,

Thanks for the pointer. I tested the current akpm/mm mm-unstable branch at
444fc9435e57, which contains Pedro's v3 two-patch mprotect series: the
softleaf refactor and the relevant small-folio / nr_ptes == 1 changes.

I first ran a local sanity check, and then reran the same shared-dirty
full-range toggle workload on the lab machine:

kernels: v6.12.77, v6.19.9, akpm/mm mm-unstable 444fc9435e57
QEMU: direct boot
lab guest CPUs: QEMU_SMP=1/2/4/8/16
lab guest memory: 14336 MiB for 1/2/4 CPU, 16384 MiB for 8 CPU,
32768 MiB for 16 CPU
repetitions: 9
order: interleaved
coverage: disabled

The primary metric is cycle_ns_per_page, lower is better. Here "cycle" means
one workload iteration, not CPU cycles:

CPU v6.12.77 v6.19.9 mm-unstable mm-unstable vs v6.19 gap closed
1 336.1 532.0 497.0 6.6% faster 17.9%
2 369.2 581.9 503.3 13.5% faster 36.9%
4 355.7 587.2 524.2 10.7% faster 27.2%
8 369.7 583.6 534.2 8.5% faster 23.1%
16 374.8 607.1 547.8 9.8% faster 25.5%

The 1/2/4/8 CPU rows completed 9/9 runs for all three kernels. In the
16 CPU row, v6.12.77 had one QEMU failure, so I would treat that row only
as a supporting trend.

So yes, Pedro's small-folio work does reduce this synthetic shared-dirty
signal in my setup. It does not seem to remove most of the gap to v6.12.77:
looking at cycle_ns_per_page, it closes roughly 18-37% of the v6.12 ->
v6.19 gap in the clean 1/2/4/8 CPU lab rows.

I also ran a separate state-shape audit, because the MADV_PAGEOUT follow-up
showed that a timing delta can be misleading if the compared kernels are not
actually operating on the same page state. For this mprotect workload, the
successful runs across v6.12.77, v6.19.9, and mm-unstable all used the same
4 KiB shared-dirty PTE mapping shape:

expected_match_ratio = 100
unexpected_results = 0
final_vmas_avg = 1
present pages before/after protect = 16384 / 16384
AnonHugePages = 0
KernelPageSize/MMUPageSize = 4 KiB / 4 KiB
THPeligible = 0

The state audit used the same 1/2/4/8/16 CPU and memory matrix, with 5 runs
per kernel. The 1/2/4/8 CPU rows completed 5/5 for all three kernels; the
16 CPU row had one v6.19.9 QEMU failure, but the successful v6.19.9 runs had
the same state-shape values.

I put the follow-up summaries here:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/mprotect-shared-dirty-toggle/mm-unstable-lab-sanity

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/mprotect-shared-dirty-toggle/state-audit-lab

Given Lorenzo's question and the synthetic nature of this workload, I will
avoid treating this as a strong regression claim unless I can provide a
standalone reproducer and/or a narrower bisect. If this remaining signal is
still useful to characterize, I can prepare a smaller standalone reproducer
or try to bisect the remaining gap.

Thanks,
Chengfeng

> -----Original Message-----
> From: "Chengfeng Lin" <23020251154299@xxxxxxxxxxxxxx>
> Sent:Tuesday, 05/19/2026 01:01:54
> To: "David Hildenbrand (Arm)" <david@xxxxxxxxxx>
> Cc: "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, linux-mm@xxxxxxxxx, "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, "Johannes Weiner" <hannes@xxxxxxxxxxx>, "Michal Hocko" <mhocko@xxxxxxxxxx>, "Qi Zheng" <zhengqi.arch@xxxxxxxxxxxxx>, "Shakeel Butt" <shakeel.butt@xxxxxxxxx>, "Chris Li" <chrisl@xxxxxxxxxx>, "Kairui Song" <kasong@xxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, regressions@xxxxxxxxxxxxxxx, "Pedro Falcato" <pfalcato@xxxxxxx>, ljs@xxxxxxxxxx
> Subject: Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12
>
>
> Hi David,
>
> Thanks for the pointer. I have not tested Pedro's mprotect
> micro-optimization series yet.
>
> That series does look directly relevant to this report, especially the
> change_pte_range() / small-folio path. My current data only compares
> v6.12.77 with v6.19.9, so I do not yet know whether that newer series
> fixes or reduces the slowdown.
>
> I will rerun the shared-dirty toggle workload on a kernel with that series
> applied, or on the first branch/tag where it is included, and report back
> with the same 1/2/4 CPU matrix if possible.
>
> If it removes most of the delta, I will follow up and mark the report
> accordingly.
>
> Thanks,
> Chengfeng
>
> > -----Original Message-----
> > From: "David Hildenbrand (Arm)" <david@xxxxxxxxxx>
> > Sent:Monday, 05/18/2026 23:36:27
> > To: "Chengfeng Lin" <23020251154299@xxxxxxxxxxxxxx>, "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, linux-mm@xxxxxxxxx
> > Cc: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, "Lorenzo Stoakes" <lorenzo.stoakes@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, "Johannes Weiner" <hannes@xxxxxxxxxxx>, "Michal Hocko" <mhocko@xxxxxxxxxx>, "Qi Zheng" <zhengqi.arch@xxxxxxxxxxxxx>, "Shakeel Butt" <shakeel.butt@xxxxxxxxx>, "Chris Li" <chrisl@xxxxxxxxxx>, "Kairui Song" <kasong@xxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, regressions@xxxxxxxxxxxxxxx, "Pedro Falcato" <pfalcato@xxxxxxx>
> > Subject: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
> >
> > On 5/18/26 15:01, Chengfeng Lin wrote:
> > > Hi,
> > >
> > > I would like to report a userspace-visible mprotect() performance
> > > regression in a shared dirty PTE workload.
> > >
> > > The workload is intentionally narrow:
> > >
> > > - anonymous shared 64 MiB mapping
> > > - prefault before protection changes
> > > - repeatedly toggle the whole range with mprotect(PROT_READ)
> > > - restore with mprotect(PROT_READ | PROT_WRITE)
> > > - write-touch after the protection cycle
> > >
> > > This is not meant as a generic mprotect() regression report. In
> > > particular, I am not claiming that the anon/THP mprotect paths regress.
> > > The current signal is scoped to the shared-dirty full-range PTE toggle
> > > path above.
> > >
> > > The current public evidence bundle is here:
> > >
> > > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle
> > >
> > > The generated workload source used for auditing the workload semantics is
> > > here:
> > >
> > > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c
> > >
> > > The formal experiment profile is here:
> > >
> > > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments
> > >
> > > The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> > > configuration, using QEMU direct boot. The formal performance runs were
> > > clean timing runs with coverage disabled. Coverage was collected
> > > separately and is not used for the timing numbers below.
> > >
> > > Lab environment:
> > >
> > > host label: lcf
> > > host kernel: Linux 6.14.0-37-generic x86_64
> > > QEMU: qemu-system-x86_64 8.2.2
> > > container/cgroup CPU set: 0,2,4,6,8,10,12,14
> > > container/cgroup memory limit: 16106127360 bytes
> > > guest memory: QEMU_MEM_MB=14336
> > > guest CPUs: QEMU_SMP=1/2/4
> > > repetitions: 9
> > > version order: interleaved
> > > performance coverage_enabled: false
> > >
> > > Primary result, cycle_ns_per_page, lower is better:
> > >
> > > CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12 reliability
> > > 1 346.8 578.1 40.0% 1.67x reliable
> > > 2 394.7 641.7 38.5% 1.63x robust-only
> > > 4 381.1 624.8 39.0% 1.64x partial, same direction
> > >
> > > The strongest current result is the 1CPU lab formal result. The 2CPU case
> > > is same-direction but robust-only in the framework classification. The
> > > 4CPU case is same-direction but partial because one QEMU run failed; the
> > > summary still has 8 successful runs for that CPU count.
> > >
> > > The current mechanism hypothesis is local to the shared-dirty PTE path.
> > > In v6.19, the measured hot path goes through the change_pte_range()
> > > batching machinery:
> > >
> > > change_pte_range()
> > > -> mprotect_folio_pte_batch()
> > > -> modify_prot_start_ptes()
> > > -> set_write_prot_commit_flush_ptes()
> > > -> prot_commit_flush_ptes()
> > >
> > > For this shared-dirty workload, follow-up batch-probe attribution showed
> > > nr_ptes=1 in the measured path. The hypothesis is that the extra folio
> > > lookup, batch-size query, helper dispatch, and commit machinery are paid
> > > per 4 KiB PTE without effective batch-size amortization in this workload.
> > > This is mechanism interpretation, not a completed culprit-commit bisect.
> > >
> > > I have not bisected the exact culprit commit yet. Separate release-level
> > > sanity checks showed v6.18.19 already in the slow range, so the current
> > > best reporting range is:
> > >
> > > #regzbot introduced: v6.12..v6.18
> > >
> > > Please let me know if a standalone reproducer, a narrower bisect, or
> > > additional raw logs would be more useful.
> >
> > Pedro recently optimized this:
> >
> > https://lore.kernel.org/all/20260402141628.3367596-1-pfalcato@xxxxxxx/
> >
> > Maybe that fixes most of the regression for you?
> >
> > --
> > Cheers,
> >
> > David