Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12

From: Chengfeng Lin

Date: Mon May 18 2026 - 13:08:12 EST

Hi David,

Thanks for the pointer. I have not tested Pedro's mprotect
micro-optimization series yet.

That series does look directly relevant to this report, especially the
change_pte_range() / small-folio path. My current data only compares
v6.12.77 with v6.19.9, so I do not yet know whether that newer series
fixes or reduces the slowdown.

I will rerun the shared-dirty toggle workload on a kernel with that series
applied, or on the first branch/tag where it is included, and report back
with the same 1/2/4 CPU matrix if possible.

If it removes most of the delta, I will follow up and mark the report
accordingly.

Thanks,
Chengfeng

> -----Original Message-----
> From: "David Hildenbrand (Arm)" <david@xxxxxxxxxx>
> Sent:Monday, 05/18/2026 23:36:27
> To: "Chengfeng Lin" <23020251154299@xxxxxxxxxxxxxx>, "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, linux-mm@xxxxxxxxx
> Cc: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, "Lorenzo Stoakes" <lorenzo.stoakes@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, "Johannes Weiner" <hannes@xxxxxxxxxxx>, "Michal Hocko" <mhocko@xxxxxxxxxx>, "Qi Zheng" <zhengqi.arch@xxxxxxxxxxxxx>, "Shakeel Butt" <shakeel.butt@xxxxxxxxx>, "Chris Li" <chrisl@xxxxxxxxxx>, "Kairui Song" <kasong@xxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, regressions@xxxxxxxxxxxxxxx, "Pedro Falcato" <pfalcato@xxxxxxx>
> Subject: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
>
> On 5/18/26 15:01, Chengfeng Lin wrote:
> > Hi,
> >
> > I would like to report a userspace-visible mprotect() performance
> > regression in a shared dirty PTE workload.
> >
> > The workload is intentionally narrow:
> >
> > - anonymous shared 64 MiB mapping
> > - prefault before protection changes
> > - repeatedly toggle the whole range with mprotect(PROT_READ)
> > - restore with mprotect(PROT_READ | PROT_WRITE)
> > - write-touch after the protection cycle
> >
> > This is not meant as a generic mprotect() regression report. In
> > particular, I am not claiming that the anon/THP mprotect paths regress.
> > The current signal is scoped to the shared-dirty full-range PTE toggle
> > path above.
> >
> > The current public evidence bundle is here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle
> >
> > The generated workload source used for auditing the workload semantics is
> > here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c
> >
> > The formal experiment profile is here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments
> >
> > The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> > configuration, using QEMU direct boot. The formal performance runs were
> > clean timing runs with coverage disabled. Coverage was collected
> > separately and is not used for the timing numbers below.
> >
> > Lab environment:
> >
> > host label: lcf
> > host kernel: Linux 6.14.0-37-generic x86_64
> > QEMU: qemu-system-x86_64 8.2.2
> > container/cgroup CPU set: 0,2,4,6,8,10,12,14
> > container/cgroup memory limit: 16106127360 bytes
> > guest memory: QEMU_MEM_MB=14336
> > guest CPUs: QEMU_SMP=1/2/4
> > repetitions: 9
> > version order: interleaved
> > performance coverage_enabled: false
> >
> > Primary result, cycle_ns_per_page, lower is better:
> >
> > CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12 reliability
> > 1 346.8 578.1 40.0% 1.67x reliable
> > 2 394.7 641.7 38.5% 1.63x robust-only
> > 4 381.1 624.8 39.0% 1.64x partial, same direction
> >
> > The strongest current result is the 1CPU lab formal result. The 2CPU case
> > is same-direction but robust-only in the framework classification. The
> > 4CPU case is same-direction but partial because one QEMU run failed; the
> > summary still has 8 successful runs for that CPU count.
> >
> > The current mechanism hypothesis is local to the shared-dirty PTE path.
> > In v6.19, the measured hot path goes through the change_pte_range()
> > batching machinery:
> >
> > change_pte_range()
> > -> mprotect_folio_pte_batch()
> > -> modify_prot_start_ptes()
> > -> set_write_prot_commit_flush_ptes()
> > -> prot_commit_flush_ptes()
> >
> > For this shared-dirty workload, follow-up batch-probe attribution showed
> > nr_ptes=1 in the measured path. The hypothesis is that the extra folio
> > lookup, batch-size query, helper dispatch, and commit machinery are paid
> > per 4 KiB PTE without effective batch-size amortization in this workload.
> > This is mechanism interpretation, not a completed culprit-commit bisect.
> >
> > I have not bisected the exact culprit commit yet. Separate release-level
> > sanity checks showed v6.18.19 already in the slow range, so the current
> > best reporting range is:
> >
> > #regzbot introduced: v6.12..v6.18
> >
> > Please let me know if a standalone reproducer, a narrower bisect, or
> > additional raw logs would be more useful.
>
> Pedro recently optimized this:
>
> https://lore.kernel.org/all/20260402141628.3367596-1-pfalcato@xxxxxxx/
>
> Maybe that fixes most of the regression for you?
>
> --
> Cheers,
>
> David