Re: [PATCH v5 0/5] KSM: performance optimizations for rmap_walk_ksm

From: Andrew Morton

Date: Tue May 19 2026 - 13:04:49 EST


On Tue, 19 May 2026 22:05:36 +0800 (CST) <xu.xin16@xxxxxxxxxx> wrote:

> From: xu xin <xu.xin16@xxxxxxxxxx>
>
> When available memory is extremely tight, causing KSM pages to be swapped
> out, or when there is significant memory fragmentation and THP triggers
> memory compaction, the system will invoke the rmap_walk_ksm function to
> perform reverse mapping. However, we observed that this function becomes
> particularly time-consuming when a large number of VMAs (e.g., 20,000)
> share the same anon_vma. Through debug trace analysis, we found that most
> of the latency occurs within anon_vma_interval_tree_foreach, leading to an
> excessively long hold time on the anon_vma lock (even reaching 500ms or
> more), which in turn causes upper-layer applications (waiting for the
> anon_vma lock) to be blocked for extended periods.
>
> This series fixes a severe KSM reverse-mapping performance problem
> that can freeze applications for hundreds of milliseconds under
> memory pressure especially when a lot of unrelated VMAs sharing a
> single anon_vma.

That would be good to fix.

> Two key highlights:
>
> 1. Lock hold time drops from >500ms to <2ms
> - In our benchmark (20,000 VMAs sharing an anon_vma), worst-case
> anon_vma lock hold time during KSM rmap walk went from 705ms
> down to 1.67ms (max) and 1.44ms (avg).

How real-worldish is that benchmark?

How much effect are our users likely to see from this patchset in their
real-world workloads?

> 2. Real user impact
> - The anon_vma lock is also acquired by page faults, reclaim,
> migration, compaction, mlock, exit_mmap, and cgroup accounting.
>
> - A long hold due to inefficient rmap walks stalls application
> threads, causing latency spikes, reduced throughput, or even
> container timeouts.
>
> - The problem occurs even without fork() – VMA splitting (e.g.,
> via mprotect or madvise over time) can create tens of thousands
> of VMAs all attached to the same anon_vma.
>
> ...
>
> Changes in v5:
> - Patch 1: replaced local_clock() with tracepoints – no overhead
> when tracepoints are disabled.

Thanks for that change.

> - Patch 3: switched from vm_pgoff (unstable after VMA split) to a
> linear page offset.
> - Patch 4: adapted to the linear page offset; added user-impact
> description (real workloads, lock contention examples,
> VMA splitting scenario).
> - Patch 5: simplified to a single process with 32 pages (instead
> of multi-process), as suggested by David.
>
> MAINTAINERS | 3 +

I don't recall seeing any discussion about you becoming an rmap
M:aintainer, perhaps I missed it. Thanks for the interest, but it
probably would be better to propose this as a standalone patch,
separated from this series.


> include/trace/events/rmap.h | 73 ++++
> mm/ksm.c | 48 ++-
> mm/rmap.c | 9 +
> tools/testing/rmap/Makefile | 11 +
> tools/testing/rmap/rmap_benchmark.c | 529 +++++++++++++++++++++++++++
> tools/testing/selftests/mm/rmap.c | 76 ++++
> tools/testing/selftests/mm/vm_util.c | 38 ++
> tools/testing/selftests/mm/vm_util.h | 2 +
> 9 files changed, 781 insertions(+), 8 deletions(-)
> create mode 100644 include/trace/events/rmap.h
> create mode 100644 tools/testing/rmap/Makefile
> create mode 100644 tools/testing/rmap/rmap_benchmark.c

AI review was only partial, for unclear reasons:

https://sashiko.dev/#/patchset/20260519220536792dMIKRMurt3vZ5lXC5pwh8@xxxxxxxxxx

Please take a look, see if there's anything useful there?