[PATCH v5 0/5] KSM: performance optimizations for rmap_walk_ksm
From: xu.xin16
Date: Tue May 19 2026 - 10:14:41 EST
From: xu xin <xu.xin16@xxxxxxxxxx>
When available memory is extremely tight, causing KSM pages to be swapped
out, or when there is significant memory fragmentation and THP triggers
memory compaction, the system will invoke the rmap_walk_ksm function to
perform reverse mapping. However, we observed that this function becomes
particularly time-consuming when a large number of VMAs (e.g., 20,000)
share the same anon_vma. Through debug trace analysis, we found that most
of the latency occurs within anon_vma_interval_tree_foreach, leading to an
excessively long hold time on the anon_vma lock (even reaching 500ms or
more), which in turn causes upper-layer applications (waiting for the
anon_vma lock) to be blocked for extended periods.
This series fixes a severe KSM reverse-mapping performance problem
that can freeze applications for hundreds of milliseconds under
memory pressure especially when a lot of unrelated VMAs sharing a
single anon_vma.
Two key highlights:
1. Lock hold time drops from >500ms to <2ms
- In our benchmark (20,000 VMAs sharing an anon_vma), worst-case
anon_vma lock hold time during KSM rmap walk went from 705ms
down to 1.67ms (max) and 1.44ms (avg).
2. Real user impact
- The anon_vma lock is also acquired by page faults, reclaim,
migration, compaction, mlock, exit_mmap, and cgroup accounting.
- A long hold due to inefficient rmap walks stalls application
threads, causing latency spikes, reduced throughput, or even
container timeouts.
- The problem occurs even without fork() – VMA splitting (e.g.,
via mprotect or madvise over time) can create tens of thousands
of VMAs all attached to the same anon_vma.
Patch summary:
==============
patch 1/5: mm/rmap: add tracepoint for rmap_walk
- Zero overhead when disabled; offline latency analysis.
patch 2/5: tools/testing: add rmap benchmark
- Measures KSM/anon/file rmap walks.
patch 3/5: ksm: add pgoff into ksm_rmap_item
- Stores linear page offset (not vm_pgoff) using a union.
- Cleared on failure paths including break_cow().
patch 4/5: ksm: optimize rmap_walk_ksm by passing a suitable range
- Uses stored pgoff to narrow interval tree search.
- Reduces iterations from >22k to ~3; lock hold 705ms ->1.67ms.
- Includes detailed user-impact description (suggested by Andrew).
patch 5/5: ksm: add mremap selftests for ksm_rmap_walk
- Single-process, 32 pages; covers mremap + KSM + migration.
---
Changes in v5:
- Patch 1: replaced local_clock() with tracepoints – no overhead
when tracepoints are disabled.
- Patch 3: switched from vm_pgoff (unstable after VMA split) to a
linear page offset.
- Patch 4: adapted to the linear page offset; added user-impact
description (real workloads, lock contention examples,
VMA splitting scenario).
- Patch 5: simplified to a single process with 32 pages (instead
of multi-process), as suggested by David.
Changes in v4:
- Add a tracepoint for rmap_walk
- Provide a testbench for rmap_walk
- Add vm_pgoff field in ksm_rmap_item
- use vm_pgoff instead of address >> PAGE_SHIFT (Suggested by David and Lorenzo)
Changes in v3:
- Fix some typos in commit description
- Replace "pgoff_start" and 'pgoff_end' by 'pgoff'.
Changes in v2:
- Use const variable to initialize 'addr' "pgoff_start" and 'pgoff_end'
- Let pgoff_end = pgoff_start, since KSM folios are always order-0 (Suggested by David)
xu xin (5):
mm/rmap: add tracepoint for rmap_walk
tools/testing: add rmap walk latency benchmark for KSM, anonymous and
file pages
ksm: add pgoff into ksm_rmap_item
ksm: Optimize rmap_walk_ksm by passing a suitable address range
ksm: add mremap selftests for ksm_rmap_walk
MAINTAINERS | 3 +
include/trace/events/rmap.h | 73 ++++
mm/ksm.c | 48 ++-
mm/rmap.c | 9 +
tools/testing/rmap/Makefile | 11 +
tools/testing/rmap/rmap_benchmark.c | 529 +++++++++++++++++++++++++++
tools/testing/selftests/mm/rmap.c | 76 ++++
tools/testing/selftests/mm/vm_util.c | 38 ++
tools/testing/selftests/mm/vm_util.h | 2 +
9 files changed, 781 insertions(+), 8 deletions(-)
create mode 100644 include/trace/events/rmap.h
create mode 100644 tools/testing/rmap/Makefile
create mode 100644 tools/testing/rmap/rmap_benchmark.c
--
2.25.1