Re: Re: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12

From: Chengfeng Lin

Date: Fri May 22 2026 - 05:15:44 EST

Hi,

I followed up on the MADV_PAGEOUT/no-swap report with local and lab
ftrace/smaps checks.

First, to correct the original report: with no swap configured, this should
not be described as a real pageout/refault workload. A more accurate
description is a MADV_PAGEOUT anon/THP no-swap reclaim-failure path. The
later write-touch is only part of the workload iteration, not evidence that
the pages were actually paged out and faulted back in.

Also, cycle_ns_per_page is wall-clock ns/page for one full workload
iteration. It is not a CPU-cycle counter.

The new lab attribution checks covered:

- QEMU_SMP=1/2/4 with QEMU_MEM_MB=14336
- QEMU_SMP=8 with QEMU_MEM_MB=16384
- QEMU_SMP=16 with QEMU_MEM_MB=32768
- THP modes: default, hugepage, nohugepage

These are ftrace/smaps attribution runs, not clean timing runs.

The important result is that the default and hugepage cases were not a
same-actual-THP-state comparison between v6.12.77 and v6.19.9:

- v6.12.77 had AnonHugePages=0 kB
- v6.19.9 had AnonHugePages=16384 kB
- v6.19.9 hit split_folio_to_list()
- v6.12.77 did not hit split_folio_to_list()

In the nohugepage case, both kernels had AnonHugePages=0 kB and neither hit
split_folio_to_list(), but the old-version-faster timing signal was not
stable there.

So I no longer think the original timing table should be treated as a
same-state THP performance regression. The more accurate conclusion is that
the test exposed a no-swap MADV_PAGEOUT anon/THP reclaim split/failure-path
case in v6.19.9, but the compared kernels were not operating on the same
actual THP backing state.

The cleaned-up evidence notes are here:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/madvise-pageout-thp-noswap-refault

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/madvise-pageout-thp-noswap-refault/attribution

If this path is still interesting as an optimization topic, I can keep looking
at whether the no-swap THP split/failure path can fail faster or skip the split.
But I do not want to overstate the original report as a same-state regression.

Thanks,
Chengfeng

> -----Original Message-----
> From: "Chengfeng Lin" <23020251154299@xxxxxxxxxxxxxx>
> Sent:Tuesday, 05/19/2026 15:28:00
> To: "Kairui Song" <ryncsn@xxxxxxxxx>
> Cc: "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, linux-mm@xxxxxxxxx, "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, "David Hildenbrand" <david@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, "Johannes Weiner" <hannes@xxxxxxxxxxx>, "Michal Hocko" <mhocko@xxxxxxxxxx>, "Qi Zheng" <zhengqi.arch@xxxxxxxxxxxxx>, "Shakeel Butt" <shakeel.butt@xxxxxxxxx>, "Chris Li" <chrisl@xxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, regressions@xxxxxxxxxxxxxxx, ljs@xxxxxxxxxx
> Subject: Re: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
>
> Hi,
>
> Thanks for looking at this. Your comments make sense, and sorry for the
> confusing wording in the original report.
>
> You are right that, with no swap configured, this should not be described
> as a real pageout/refault case. A more accurate description is a
> MADV_PAGEOUT anon/THP no-swap reclaim-failure path. The later write-touch
> is just part of the workload iteration; it should not be presented as
> evidence that the pages were actually paged out and faulted back in.
>
> Also, cycle_ns_per_page is poorly named. Here "cycle" means one full
> workload iteration, not CPU cycles. The metric is wall-clock nanoseconds
> per page for one complete workload iteration. I will make that clearer in
> future descriptions.
>
> I agree that the end-to-end timing by itself is not enough to show where
> the time is spent. I will collect a perf or ftrace breakdown comparing
> v6.12.77 and v6.19.9, and follow up with the results on this thread.
>
> Thanks,
> Chengfeng
>
>
>
> > -----Original Message-----
> > From: "Kairui Song" <ryncsn@xxxxxxxxx>
> > Sent:Tuesday, 05/19/2026 02:14:41
> > To: "Chengfeng Lin" <23020251154299@xxxxxxxxxxxxxx>
> > Cc: "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, linux-mm@xxxxxxxxx, "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, "Lorenzo Stoakes" <lorenzo.stoakes@xxxxxxxxxx>, "David Hildenbrand" <david@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, "Johannes Weiner" <hannes@xxxxxxxxxxx>, "Michal Hocko" <mhocko@xxxxxxxxxx>, "Qi Zheng" <zhengqi.arch@xxxxxxxxxxxxx>, "Shakeel Butt" <shakeel.butt@xxxxxxxxx>, "Chris Li" <chrisl@xxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, regressions@xxxxxxxxxxxxxxx
> > Subject: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
> >
> > On Mon, May 18, 2026 at 9:09 PM Chengfeng Lin
> > <23020251154299@xxxxxxxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > I would like to report a userspace-visible performance regression in a
> > > MADV_PAGEOUT workload.
> >
> > Hi Chengfeng,
> >
> > Thanks for the report. Very interesting, but it looks a bit confusing.
> > You are doing PAGEOUT of anon pages without swap setup, so nothing is
> > actually being pageouted?
> >
> > > The workload is intentionally narrow:
> > >
> > > - map 16 MiB anonymous memory
> > > - use the default THP policy
> > > - run in a guest with no configured swap
> > > - call madvise(MADV_PAGEOUT)
> > > - refault/write-touch the mapping
> > >
> > > This is not meant as a generic madvise() or generic MADV_PAGEOUT
> > > regression report. The signal is currently scoped to the THP + no-swap +
> > > refault/write-touch workflow above.
> > >
> > > The current public evidence bundle is here:
> > >
> > > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault
> > >
> > > The standalone workload source is here:
> > >
> > > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/workload
> > >
> > > The formal experiment profile is here:
> > >
> > > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/experiments
> > >
> > > The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> > > configuration, using QEMU direct boot. The formal performance runs were
> > > clean timing runs with coverage disabled. Coverage was collected
> > > separately and is not used for the timing numbers below.
> > >
> > > Lab environment:
> > >
> > > host label: lcf
> > > host kernel: Linux 6.14.0-37-generic x86_64
> > > QEMU: qemu-system-x86_64 8.2.2
> > > container/cgroup CPU set: 0,2,4,6,8,10,12,14
> > > container/cgroup memory limit: 16106127360 bytes
> > > guest memory: QEMU_MEM_MB=14336
> > > guest CPUs: QEMU_SMP=1/2/4
> > > repetitions: 9
> > > version order: interleaved
> > > performance coverage_enabled: false
> > >
> > > Primary result, cycle_ns_per_page, lower is better:
> >
> > What is cycle_ns_per_page? The name is ambiguous. "cycle" reads like
> > CPU cycles but the value doesn't match. Or you mean iteration? Can you
> > make this cleaner?
> >
> > > CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12
> > > 1 1900.3 3304.7 42.5% 1.74x
> > > 2 2107.7 3583.2 41.2% 1.70x
> > > 4 2154.2 3690.9 41.6% 1.71x
> > >
> > > MADV_PAGEOUT syscall/reclaim-side segment, advise_ns_per_page, lower is
> > > better:
> > >
> > > CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12
> > > 1 1713.2 2922.7 41.4% 1.71x
> > > 2 1924.7 3162.9 39.1% 1.64x
> > > 4 1953.1 3284.2 40.5% 1.68x
> > >
> > > The current mechanism interpretation is that the timing difference is in
> > > the MADV_PAGEOUT/reclaim part, not primarily in the later refault touch.
> >
> > Right, with no swap configured, the post-advise should not be faulting
> > at all I think? It will just split the pages IIUC, however, that might
> > cause more TLB pressure. But this split-on-alloc-failure path itself
> > wasn't changed from 6.12 to 6.18. If there is no swap, we can skip the
> > split though, and maybe make folio_alloc_swap fail a bit faster.
> > Skipping the split seems more meaningful and could significantly speed
> > up this specific test, if this is necessary.
> >
> > > The path evidence points at the no-swap reclaim/swap-allocation-failure
> > > chain:
> > >
> > > madvise(MADV_PAGEOUT)
> > > -> reclaim_pages()
> > > -> shrink_folio_list()
> > > -> folio_alloc_swap()
> > > -> swap allocation failure path
> >
> > What you posted, include that repo, seems only has end to end timing.
> > It would be much more
> > useful if you collect a perf or ftrace breakdown something like:
> > https://lore.kernel.org/all/CAMgjq7BfO=dNYep4z1aS7nUAJU3bktR17gYAufx=kkLudq4dAQ@xxxxxxxxxxxxxx/
> >
> > Or like this:
> > https://lore.kernel.org/all/20260422003412.11678-1-xueyuan.chen21@xxxxxxxxx/