Re: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12

From: Chengfeng Lin

Date: Tue May 19 2026 - 03:34:52 EST


Hi,

Thanks for looking at this. Your comments make sense, and sorry for the
confusing wording in the original report.

You are right that, with no swap configured, this should not be described
as a real pageout/refault case. A more accurate description is a
MADV_PAGEOUT anon/THP no-swap reclaim-failure path. The later write-touch
is just part of the workload iteration; it should not be presented as
evidence that the pages were actually paged out and faulted back in.

Also, cycle_ns_per_page is poorly named. Here "cycle" means one full
workload iteration, not CPU cycles. The metric is wall-clock nanoseconds
per page for one complete workload iteration. I will make that clearer in
future descriptions.

I agree that the end-to-end timing by itself is not enough to show where
the time is spent. I will collect a perf or ftrace breakdown comparing
v6.12.77 and v6.19.9, and follow up with the results on this thread.

Thanks,
Chengfeng



> -----Original Message-----
> From: "Kairui Song" <ryncsn@xxxxxxxxx>
> Sent:Tuesday, 05/19/2026 02:14:41
> To: "Chengfeng Lin" <23020251154299@xxxxxxxxxxxxxx>
> Cc: "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, linux-mm@xxxxxxxxx, "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, "Lorenzo Stoakes" <lorenzo.stoakes@xxxxxxxxxx>, "David Hildenbrand" <david@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, "Johannes Weiner" <hannes@xxxxxxxxxxx>, "Michal Hocko" <mhocko@xxxxxxxxxx>, "Qi Zheng" <zhengqi.arch@xxxxxxxxxxxxx>, "Shakeel Butt" <shakeel.butt@xxxxxxxxx>, "Chris Li" <chrisl@xxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, regressions@xxxxxxxxxxxxxxx
> Subject: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
>
> On Mon, May 18, 2026 at 9:09 PM Chengfeng Lin
> <23020251154299@xxxxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > I would like to report a userspace-visible performance regression in a
> > MADV_PAGEOUT workload.
>
> Hi Chengfeng,
>
> Thanks for the report. Very interesting, but it looks a bit confusing.
> You are doing PAGEOUT of anon pages without swap setup, so nothing is
> actually being pageouted?
>
> > The workload is intentionally narrow:
> >
> > - map 16 MiB anonymous memory
> > - use the default THP policy
> > - run in a guest with no configured swap
> > - call madvise(MADV_PAGEOUT)
> > - refault/write-touch the mapping
> >
> > This is not meant as a generic madvise() or generic MADV_PAGEOUT
> > regression report. The signal is currently scoped to the THP + no-swap +
> > refault/write-touch workflow above.
> >
> > The current public evidence bundle is here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault
> >
> > The standalone workload source is here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/workload
> >
> > The formal experiment profile is here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/experiments
> >
> > The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> > configuration, using QEMU direct boot. The formal performance runs were
> > clean timing runs with coverage disabled. Coverage was collected
> > separately and is not used for the timing numbers below.
> >
> > Lab environment:
> >
> > host label: lcf
> > host kernel: Linux 6.14.0-37-generic x86_64
> > QEMU: qemu-system-x86_64 8.2.2
> > container/cgroup CPU set: 0,2,4,6,8,10,12,14
> > container/cgroup memory limit: 16106127360 bytes
> > guest memory: QEMU_MEM_MB=14336
> > guest CPUs: QEMU_SMP=1/2/4
> > repetitions: 9
> > version order: interleaved
> > performance coverage_enabled: false
> >
> > Primary result, cycle_ns_per_page, lower is better:
>
> What is cycle_ns_per_page? The name is ambiguous. "cycle" reads like
> CPU cycles but the value doesn't match. Or you mean iteration? Can you
> make this cleaner?
>
> > CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12
> > 1 1900.3 3304.7 42.5% 1.74x
> > 2 2107.7 3583.2 41.2% 1.70x
> > 4 2154.2 3690.9 41.6% 1.71x
> >
> > MADV_PAGEOUT syscall/reclaim-side segment, advise_ns_per_page, lower is
> > better:
> >
> > CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12
> > 1 1713.2 2922.7 41.4% 1.71x
> > 2 1924.7 3162.9 39.1% 1.64x
> > 4 1953.1 3284.2 40.5% 1.68x
> >
> > The current mechanism interpretation is that the timing difference is in
> > the MADV_PAGEOUT/reclaim part, not primarily in the later refault touch.
>
> Right, with no swap configured, the post-advise should not be faulting
> at all I think? It will just split the pages IIUC, however, that might
> cause more TLB pressure. But this split-on-alloc-failure path itself
> wasn't changed from 6.12 to 6.18. If there is no swap, we can skip the
> split though, and maybe make folio_alloc_swap fail a bit faster.
> Skipping the split seems more meaningful and could significantly speed
> up this specific test, if this is necessary.
>
> > The path evidence points at the no-swap reclaim/swap-allocation-failure
> > chain:
> >
> > madvise(MADV_PAGEOUT)
> > -> reclaim_pages()
> > -> shrink_folio_list()
> > -> folio_alloc_swap()
> > -> swap allocation failure path
>
> What you posted, include that repo, seems only has end to end timing.
> It would be much more
> useful if you collect a perf or ftrace breakdown something like:
> https://lore.kernel.org/all/CAMgjq7BfO=dNYep4z1aS7nUAJU3bktR17gYAufx=kkLudq4dAQ@xxxxxxxxxxxxxx/
>
> Or like this:
> https://lore.kernel.org/all/20260422003412.11678-1-xueyuan.chen21@xxxxxxxxx/