Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12

From: Kairui Song

Date: Mon May 18 2026 - 14:31:30 EST

On Mon, May 18, 2026 at 9:09 PM Chengfeng Lin
<23020251154299@xxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> I would like to report a userspace-visible performance regression in a
> MADV_PAGEOUT workload.

Hi Chengfeng,

Thanks for the report. Very interesting, but it looks a bit confusing.
You are doing PAGEOUT of anon pages without swap setup, so nothing is
actually being pageouted?

> The workload is intentionally narrow:
>
> - map 16 MiB anonymous memory
> - use the default THP policy
> - run in a guest with no configured swap
> - call madvise(MADV_PAGEOUT)
> - refault/write-touch the mapping
>
> This is not meant as a generic madvise() or generic MADV_PAGEOUT
> regression report. The signal is currently scoped to the THP + no-swap +
> refault/write-touch workflow above.
>
> The current public evidence bundle is here:
>
> https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault
>
> The standalone workload source is here:
>
> https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/workload
>
> The formal experiment profile is here:
>
> https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/experiments
>
> The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> configuration, using QEMU direct boot. The formal performance runs were
> clean timing runs with coverage disabled. Coverage was collected
> separately and is not used for the timing numbers below.
>
> Lab environment:
>
> host label: lcf
> host kernel: Linux 6.14.0-37-generic x86_64
> QEMU: qemu-system-x86_64 8.2.2
> container/cgroup CPU set: 0,2,4,6,8,10,12,14
> container/cgroup memory limit: 16106127360 bytes
> guest memory: QEMU_MEM_MB=14336
> guest CPUs: QEMU_SMP=1/2/4
> repetitions: 9
> version order: interleaved
> performance coverage_enabled: false
>
> Primary result, cycle_ns_per_page, lower is better:

What is cycle_ns_per_page? The name is ambiguous. "cycle" reads like
CPU cycles but the value doesn't match. Or you mean iteration? Can you
make this cleaner?

> CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12
> 1 1900.3 3304.7 42.5% 1.74x
> 2 2107.7 3583.2 41.2% 1.70x
> 4 2154.2 3690.9 41.6% 1.71x
>
> MADV_PAGEOUT syscall/reclaim-side segment, advise_ns_per_page, lower is
> better:
>
> CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12
> 1 1713.2 2922.7 41.4% 1.71x
> 2 1924.7 3162.9 39.1% 1.64x
> 4 1953.1 3284.2 40.5% 1.68x
>
> The current mechanism interpretation is that the timing difference is in
> the MADV_PAGEOUT/reclaim part, not primarily in the later refault touch.

Right, with no swap configured, the post-advise should not be faulting
at all I think? It will just split the pages IIUC, however, that might
cause more TLB pressure. But this split-on-alloc-failure path itself
wasn't changed from 6.12 to 6.18. If there is no swap, we can skip the
split though, and maybe make folio_alloc_swap fail a bit faster.
Skipping the split seems more meaningful and could significantly speed
up this specific test, if this is necessary.

> The path evidence points at the no-swap reclaim/swap-allocation-failure
> chain:
>
> madvise(MADV_PAGEOUT)
> -> reclaim_pages()
> -> shrink_folio_list()
> -> folio_alloc_swap()
> -> swap allocation failure path

What you posted, include that repo, seems only has end to end timing.
It would be much more
useful if you collect a perf or ftrace breakdown something like:
https://lore.kernel.org/all/CAMgjq7BfO=dNYep4z1aS7nUAJU3bktR17gYAufx=kkLudq4dAQ@xxxxxxxxxxxxxx/

Or like this:
https://lore.kernel.org/all/20260422003412.11678-1-xueyuan.chen21@xxxxxxxxx/