[REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12

From: Chengfeng Lin

Date: Mon May 18 2026 - 09:12:45 EST

Hi,

I would like to report a userspace-visible mprotect() performance
regression in a shared dirty PTE workload.

The workload is intentionally narrow:

- anonymous shared 64 MiB mapping
- prefault before protection changes
- repeatedly toggle the whole range with mprotect(PROT_READ)
- restore with mprotect(PROT_READ | PROT_WRITE)
- write-touch after the protection cycle

This is not meant as a generic mprotect() regression report. In
particular, I am not claiming that the anon/THP mprotect paths regress.
The current signal is scoped to the shared-dirty full-range PTE toggle
path above.

The current public evidence bundle is here:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle

The generated workload source used for auditing the workload semantics is
here:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c

The formal experiment profile is here:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments

The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
configuration, using QEMU direct boot. The formal performance runs were
clean timing runs with coverage disabled. Coverage was collected
separately and is not used for the timing numbers below.

Lab environment:

host label: lcf
host kernel: Linux 6.14.0-37-generic x86_64
QEMU: qemu-system-x86_64 8.2.2
container/cgroup CPU set: 0,2,4,6,8,10,12,14
container/cgroup memory limit: 16106127360 bytes
guest memory: QEMU_MEM_MB=14336
guest CPUs: QEMU_SMP=1/2/4
repetitions: 9
version order: interleaved
performance coverage_enabled: false

Primary result, cycle_ns_per_page, lower is better:

CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12 reliability
1 346.8 578.1 40.0% 1.67x reliable
2 394.7 641.7 38.5% 1.63x robust-only
4 381.1 624.8 39.0% 1.64x partial, same direction

The strongest current result is the 1CPU lab formal result. The 2CPU case
is same-direction but robust-only in the framework classification. The
4CPU case is same-direction but partial because one QEMU run failed; the
summary still has 8 successful runs for that CPU count.

The current mechanism hypothesis is local to the shared-dirty PTE path.
In v6.19, the measured hot path goes through the change_pte_range()
batching machinery:

change_pte_range()
-> mprotect_folio_pte_batch()
-> modify_prot_start_ptes()
-> set_write_prot_commit_flush_ptes()
-> prot_commit_flush_ptes()

For this shared-dirty workload, follow-up batch-probe attribution showed
nr_ptes=1 in the measured path. The hypothesis is that the extra folio
lookup, batch-size query, helper dispatch, and commit machinery are paid
per 4 KiB PTE without effective batch-size amortization in this workload.
This is mechanism interpretation, not a completed culprit-commit bisect.

I have not bisected the exact culprit commit yet. Separate release-level
sanity checks showed v6.18.19 already in the slow range, so the current
best reporting range is:

#regzbot introduced: v6.12..v6.18

Please let me know if a standalone reproducer, a narrower bisect, or
additional raw logs would be more useful.