[PATCH v3 0/2] mm/filemap: tighten mmap_miss hit accounting

From: fujunjie

Date: Mon Apr 27 2026 - 21:57:37 EST

This is v3 of the mmap_miss hit-accounting change. v1 was sent as an
RFC. The accounting logic is unchanged from v2, but patch 1 now keeps
the workingset mmap_miss comment near the new accounting block as
Matthew requested.

- patch 1 limits fault-around hit accounting to the faulting address;
- patch 2 stops FAULT_FLAG_TRIED retries from decrementing mmap_miss.

Patch 1 also follows Jan's implementation suggestion: the helper
functions no longer propagate a mmap_miss variable, and
filemap_map_pages() updates file->f_ra.mmap_miss based on whether the
helper mapped the actual faulting address.

mmap_miss is increased when synchronous mmap readahead is needed, and
decreased when filemap_map_pages() maps folios that are already in the
page cache. The decrease side can over-credit hits in two cases:

- fault-around installs nearby PTEs even though the fault only proves
that the faulting address was accessed;
- after synchronous mmap readahead returns VM_FAULT_RETRY, the retry
can find the folio brought in by the same miss and immediately
cancel that miss.

Current evidence comes from a local KVM/data-disk microbenchmark using
mmap_miss_probe, with an 8 GiB guest, 2 vCPUs, 8192 KiB read_ahead_kb,
cold page cache before each run, 1% of the file accessed, and medians of
3 runs.

mmap_miss_probe mmap()s a prepared file with MADV_NORMAL and then
touches one byte at selected base-page offsets. The access order is
random, sequential, or a fixed page stride. The harness drops caches
before each run and samples /proc/vmstat around that access loop.

The 20 GiB case below is a larger-than-memory file case in an 8 GiB
guest. No separate memory hog was used. The 4 GiB case uses the same
8 GiB guest but keeps the file fit-in-memory.

Each case used a fresh temporary qcow2 data disk, seen by the guest as
/dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix.

Each result is "pgpgin GiB / elapsed seconds". "pgpgin GiB" is the
delta of the guest /proc/vmstat pgpgin counter, converted from KiB to
GiB; it is used here as an approximate block input counter, not as
resident memory or exact application IO. "Elapsed seconds" is the
wall-clock runtime of the whole mmap_miss_probe access pass, not
per-access latency.

For the 20 GiB larger-than-memory case:

workload before after
random 223.377 GiB/101.293s 1.010 GiB/4.790s
stride1021 204.214 GiB/97.557s 204.208 GiB/108.086s
stride2053 409.584 GiB/193.700s 0.970 GiB/3.685s
stride4099 406.452 GiB/134.241s 0.975 GiB/3.499s
sequential 0.212 GiB/0.050s 0.212 GiB/0.057s

For the 4 GiB fit-in-memory case:

workload before after
random 3.987 GiB/1.960s 0.980 GiB/1.221s
stride1021 4.002 GiB/1.838s 4.002 GiB/1.851s
stride2053 3.991 GiB/1.835s 0.811 GiB/0.985s
stride4099 4.001 GiB/1.836s 0.819 GiB/1.037s
sequential 0.056 GiB/0.013s 0.056 GiB/0.018s

The 20 GiB setup also has an ablation. P1 is only the faulting-address
hit accounting change. P2-only is only the FAULT_FLAG_TRIED retry
filter. P1+P2 is the combined accounting change:

workload variant result
random baseline 223.377 GiB/101.293s
random P1 223.268 GiB/98.481s
random P2-only 223.257 GiB/100.091s
random P1+P2 1.010 GiB/4.790s
stride2053 baseline 409.584 GiB/193.700s
stride2053 P1 409.584 GiB/197.645s
stride2053 P2-only 15.722 GiB/5.485s
stride2053 P1+P2 0.970 GiB/3.685s
sequential baseline 0.212 GiB/0.050s
sequential P1 0.212 GiB/0.046s
sequential P2-only 0.212 GiB/0.050s
sequential P1+P2 0.212 GiB/0.057s

After the v2 implementation refactor, only the final P1+P2 shape was
rerun in the same setup. The numbers stayed in line with the v1 P1+P2
rows above:

workload larger-than-memory case fit-in-memory case
20 GiB file, 1% access 4 GiB file, 1% access
random 1.010 GiB/4.383s 0.980 GiB/1.088s
stride1021 204.216 GiB/105.601s 4.001 GiB/1.783s
stride2053 0.970 GiB/3.760s 0.810 GiB/0.908s
stride4099 0.975 GiB/3.410s 0.818 GiB/0.870s
sequential 0.212 GiB/0.060s 0.056 GiB/0.016s

This does not claim to solve every sparse pattern. The stride1021 rows
are intentionally shown as a boundary: with 8192 KiB read_ahead_kb,
file->f_ra.ra_pages is 2048 base pages, and synchronous mmap
read-around uses a 2048-page window centered around the fault, roughly
[index - 1024, index + 1023]. stride1021 is 1021 * 4 KiB = 4084 KiB,
so the next access lands inside the previous read-around window. About
every other access can be a real faulting-address page-cache hit, and
the other half can each read about 8 MiB. For about 52k accesses in the
20 GiB/1% run, half of them times 8 MiB is about 205 GiB, matching the
observed 204 GiB.

---
v3:
- move the workingset mmap_miss comment to the new accounting block in
filemap_map_pages().
- no new performance run; v3 only moves a comment and does not change
executable code from v2.

v2: https://lore.kernel.org/r/tencent_4EDE373816615C46CFD48A6EF3B61E232308@xxxxxx
v1: https://lore.kernel.org/r/tencent_3F158B17AE85E73945C5F97D8F8A918F9B07@xxxxxx

v2 changes:
- split the original patch into two patches;
- move mmap_miss updating back into filemap_map_pages();
- drop the mmap_miss argument from filemap_map_order0_folio() and
filemap_map_folio_range();

fujunjie (2):
mm/filemap: count only the faulting address as a mmap hit
mm/filemap: do not count FAULT_FLAG_TRIED retries as mmap hits

mm/filemap.c | 63 ++++++++++++++++++++++++++++++------------------------------
1 file changed, 32 insertions(+), 31 deletions(-)

base-commit: 1b55f8358e35a67bf3969339ea7b86988af92f66
--
2.34.1