Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12

From: Chengfeng Lin

Date: Mon May 18 2026 - 12:52:41 EST

Hi Lorenzo,

Sorry about the stale address. I will use ljs@xxxxxxxxxx for future
kernel mails.

This is a synthetic/source-calibrated userspace micro-workload, not a
regression I observed in a production application.

The workload was generated from the mm/mprotect.c path and then narrowed
to the shared-dirty full-range PTE toggle case where the timing signal was
stable enough to report. So the intended claim is limited to "this legal
userspace mprotect pattern regressed in the test setup", not "a known real
application workload regressed".

I agree that this makes the report weaker than an application-level
regression. I sent it because the delta is large in the clean 1CPU formal
run (~1.67x slower on v6.19 vs v6.12), and the path looked plausibly tied
to the change_pte_range() batching path where the shared-dirty case did
not form an effective batch in my probe runs.

David also pointed me at Pedro's recent mprotect micro-optimization
series. I have not tested that yet, so I will first check whether that
series already removes most of this synthetic signal. If it does, I will
follow up and mark this accordingly. If the signal remains, I can prepare
a standalone reproducer and/or try to bisect the exact culprit commit
before asking you to spend more time on it.

Sorry for the noise, and thanks for taking a look.

Thanks,
Chengfeng

> -----原始邮件-----
> 发件人: "Lorenzo Stoakes" <ljs@xxxxxxxxxx>
> 发送时间:2026-05-18 23:43:10 (星期一)
> 收件人: "Chengfeng Lin" <23020251154299@xxxxxxxxxxxxxx>
> 抄送: "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, linux-mm@xxxxxxxxx, "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, "David Hildenbrand" <david@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, "Johannes Weiner" <hannes@xxxxxxxxxxx>, "Michal Hocko" <mhocko@xxxxxxxxxx>, "Qi Zheng" <zhengqi.arch@xxxxxxxxxxxxx>, "Shakeel Butt" <shakeel.butt@xxxxxxxxx>, "Chris Li" <chrisl@xxxxxxxxxx>, "Kairui Song" <kasong@xxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, regressions@xxxxxxxxxxxxxxx
> 主题: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
>
> -cc wrong email
>
> One day I will get to stop nagging like this :) Or just ignore mails that go to
> the wrong place.
>
> Please use ljs@xxxxxxxxxx. I switched over a while ago. I tend to mark kernel
> mails that go to my work address read without reading them.
>
> People regularly update their emails, so it's important to re-check them when
> you send a new mail.
>
> On Mon, May 18, 2026 at 09:01:02PM +0800, Chengfeng Lin wrote:
> > Hi,
> >
> > I would like to report a userspace-visible mprotect() performance
> > regression in a shared dirty PTE workload.
> >
> > The workload is intentionally narrow:
> >
> > - anonymous shared 64 MiB mapping
> > - prefault before protection changes
> > - repeatedly toggle the whole range with mprotect(PROT_READ)
> > - restore with mprotect(PROT_READ | PROT_WRITE)
> > - write-touch after the protection cycle
> >
> > This is not meant as a generic mprotect() regression report. In
> > particular, I am not claiming that the anon/THP mprotect paths regress.
> > The current signal is scoped to the shared-dirty full-range PTE toggle
> > path above.
> >
> > The current public evidence bundle is here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle
> >
> > The generated workload source used for auditing the workload semantics is
> > here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c
> >
> > The formal experiment profile is here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments
> >
> > The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> > configuration, using QEMU direct boot. The formal performance runs were
> > clean timing runs with coverage disabled. Coverage was collected
> > separately and is not used for the timing numbers below.
> >
> > Lab environment:
> >
> > host label: lcf
> > host kernel: Linux 6.14.0-37-generic x86_64
> > QEMU: qemu-system-x86_64 8.2.2
> > container/cgroup CPU set: 0,2,4,6,8,10,12,14
> > container/cgroup memory limit: 16106127360 bytes
> > guest memory: QEMU_MEM_MB=14336
> > guest CPUs: QEMU_SMP=1/2/4
> > repetitions: 9
> > version order: interleaved
> > performance coverage_enabled: false
> >
> > Primary result, cycle_ns_per_page, lower is better:
> >
> > CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12 reliability
> > 1 346.8 578.1 40.0% 1.67x reliable
> > 2 394.7 641.7 38.5% 1.63x robust-only
> > 4 381.1 624.8 39.0% 1.64x partial, same direction
> >
> > The strongest current result is the 1CPU lab formal result. The 2CPU case
> > is same-direction but robust-only in the framework classification. The
> > 4CPU case is same-direction but partial because one QEMU run failed; the
> > summary still has 8 successful runs for that CPU count.
> >
> > The current mechanism hypothesis is local to the shared-dirty PTE path.
> > In v6.19, the measured hot path goes through the change_pte_range()
> > batching machinery:
> >
> > change_pte_range()
> > -> mprotect_folio_pte_batch()
> > -> modify_prot_start_ptes()
> > -> set_write_prot_commit_flush_ptes()
> > -> prot_commit_flush_ptes()
> >
> > For this shared-dirty workload, follow-up batch-probe attribution showed
> > nr_ptes=1 in the measured path. The hypothesis is that the extra folio
> > lookup, batch-size query, helper dispatch, and commit machinery are paid
> > per 4 KiB PTE without effective batch-size amortization in this workload.
> > This is mechanism interpretation, not a completed culprit-commit bisect.
> >
> > I have not bisected the exact culprit commit yet. Separate release-level
> > sanity checks showed v6.18.19 already in the slow range, so the current
> > best reporting range is:
> >
> > #regzbot introduced: v6.12..v6.18
> >
> > Please let me know if a standalone reproducer, a narrower bisect, or
> > additional raw logs would be more useful.
>
> Is this really a regression you're seeing in real worklaods or synthetic?
>
> Thanks, Lorenzo