[PATCH v2] mm/mglru: use folio_mark_accessed to replace folio_set_active
From: Barry Song (Xiaomi)
Date: Mon May 25 2026 - 08:32:56 EST
MGLRU gives high priority to folios mapped in page tables. As a result,
folio_set_active() is invoked for all folios read during page faults. In
practice, however, readahead can bring in many folios that are never
accessed via page tables.
A previous attempt by Lei Liu proposed introducing a separate LRU for
readahead[1] to make readahead pages easier to reclaim, but that approach
is likely over-engineered.
Before commit 4d5d14a01e2c ("mm/mglru: rework workingset protection"),
folios with PG_active were always placed in the youngest generation,
leading to over-protection and increased refaults. After that commit,
PG_active folios are placed in the second youngest generation, which is
still too optimistic given the presence of readahead. In contrast, the
classic active/inactive scheme is more conservative.
This patch switches to using folio_mark_accessed(). If
folio_check_references() later detects referenced PTEs, the folio
will be promoted based on the reference flag set by
folio_mark_accessed(). We should also adjust
WORKINGSET_ACTIVATE and lru_gen_folio_seq(); for example, we should
not unconditionally protect anon folios accordingly.
The following uses a simple model to demonstrate why the current code is
not ideal. It runs fio-3.42 in a memcg, reading a file in a strided
pattern—4KB every 64KB—to simulate prefaulted pages that may not be
accessed.
#!/bin/bash
CG_NAME="mglru_verify_test"
CG_PATH="/sys/fs/cgroup/$CG_NAME"
MEM_LIMIT="400M"
HOT_SIZE="600M"
# 1. Environment Setup
sudo rmdir "$CG_PATH" 2>/dev/null
sudo mkdir -p "$CG_PATH"
sudo chown -R $USER:$USER "$CG_PATH"
echo "$MEM_LIMIT" > "$CG_PATH/memory.max"
# 2. Prepare Data Files
dd if=/dev/urandom of=hot_data.bin bs=1M count=600 conv=notrunc 2>/dev/null
sync
echo 3 > /proc/sys/vm/drop_caches
# 3. Start Workload (Working Set)
(
echo $BASHPID > "$CG_PATH/cgroup.procs"
exec ./fio-3.42 --name=hot_ws --rw=read --bs=4K --size=$HOT_SIZE --runtime=600 \
--zonemode=strided --zonesize=4K --zonerange=64K \
--time_based --direct=0 --filename=hot_data.bin --ioengine=mmap \
--fadvise_hint=0 --group_reporting --numjobs=1 > fio.stats
) &
WORKLOAD_PID=$!
# 4. Waiting for hot data to warm up
sleep 30
BASE_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
# 5. Running workload for 60second
sleep 60
# 6. Report refault and IO bandwidth
FINAL_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
FINAL_D_FILE=$((FINAL_FILE - BASE_FILE))
echo "File Refault Delta is $FINAL_D_FILE"
kill $WORKLOAD_PID 2>/dev/null
sleep 2
grep -E "READ|WRITE" fio.stats \
| awk '{for(i=1;i<=NF;i++){if($i ~ /^bw=/) bw=$i; if($i ~ /^io=/) io=$i} print $1, bw, io}'
rm -f hot_data.bin fio.stats
Without the patch, we observed 12883855 file refaults and a very low
bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy hot
positions, continuously pushing out the real working set and causing
incorrect reclaim. With the patch, we observed 0 refaults and bandwidth
increased to 5078 MiB/s.
Running the same test on x86:
w/o patch:
File Refault Delta is 3240029
READ: bw=13.2MiB/s io=1183MiB
w/ patch:
File Refault Delta is 0
READ: bw=7708MiB/s io=676GiB
On x86, running a kernel build inside a memcg with a 1GB memory
limit using 20 threads.
w/o patch:
real 1m50.764s
user 25m32.305s
sys 4m0.012s
pswpin: 1333245
pswpout: 4366443
pgpgin: 6962592
pgpgout: 17780712
swpout_zero: 1019603
swpin_zero: 14764
refault_file: 287794
refault_anon: 1347963
w/ patch:
real 1m48.915s
user 25m31.261s
sys 3m43.685s
pswpin: 915629
pswpout: 3207173
pgpgin: 5249268
pgpgout: 13154492
swpout_zero: 816100
swpin_zero: 15676
refault_file: 257271
refault_anon: 931259
active/inactive LRU:
real 1m49.928s
user 25m28.196s
sys 3m40.740s
pswpin: 463452
pswpout: 2309119
pgpgin: 4438856
pgpgout: 9568628
swpout_zero: 743704
swpin_zero: 7244
refault_file: 562555
refault_anon: 470694
Lance and Xueyuan made a huge contribution to this patch through testing.
[1] https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@xxxxxxxx/
Signed-off-by: Barry Song (Xiaomi) <baohua@xxxxxxxxxx>
Tested-by: Lance Yang <lance.yang@xxxxxxxxx>
Tested-by: Xueyuan Chen <xueyuan.chen21@xxxxxxxxx>
Cc: Pedro Falcato <pfalcato@xxxxxxx>
Cc: Kairui Song <kasong@xxxxxxxxxxx>
Cc: Qi Zheng <qi.zheng@xxxxxxxxx>
Cc: Shakeel Butt <shakeel.butt@xxxxxxxxx>
Cc: wangzicheng <wangzicheng@xxxxxxxxx>
Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
Cc: Lei Liu <liulei.rjpt@xxxxxxxx>
Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
Cc: Yuanchu Xie <yuanchu@xxxxxxxxxx>
Cc: Wei Xu <weixugc@xxxxxxxxxx>
Cc: Will Deacon <will@xxxxxxxxxx>
Cc: Kalesh Singh <kaleshsingh@xxxxxxxxxx>
---
-v2:
* Fix WORKINGSET_ACTIVATE - workingset will be set to active during refault;
* Avoid unconditional protecting anon folios in lru_gen_folio_seq();
* Also adjusted workingset set accordingly in folio_check_references().
-v1:
https://lore.kernel.org/linux-mm/20260418120233.7162-1-baohua@xxxxxxxxxx/
-rfc was:
[PATCH RFC] mm/mglru: lazily activate folios while folios are really mapped
https://lore.kernel.org/linux-mm/20260225212642.15219-1-21cnbao@xxxxxxxxx/
include/linux/mm_inline.h | 7 +++----
mm/swap.c | 8 ++++++--
mm/vmscan.c | 3 ++-
mm/workingset.c | 8 ++++----
4 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index a171070e15f0..c637e679a450 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -242,12 +242,11 @@ static inline unsigned long lru_gen_folio_seq(const struct lruvec *lruvec,
gen = MIN_NR_GENS - folio_test_workingset(folio);
else if (reclaiming)
gen = MAX_NR_GENS;
- else if ((!folio_is_file_lru(folio) && !folio_test_swapcache(folio)) ||
- (folio_test_reclaim(folio) &&
- (folio_test_dirty(folio) || folio_test_writeback(folio))))
+ else if (folio_test_reclaim(folio) &&
+ (folio_test_dirty(folio) || folio_test_writeback(folio)))
gen = MIN_NR_GENS;
else
- gen = MAX_NR_GENS - folio_test_workingset(folio);
+ gen = MAX_NR_GENS - folio_test_workingset(folio) || folio_test_referenced(folio);
return max(READ_ONCE(lrugen->max_seq) - gen + 1, READ_ONCE(lrugen->min_seq[type]));
}
diff --git a/mm/swap.c b/mm/swap.c
index 5cc44f0de987..f320f93d60df 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -511,8 +511,12 @@ void folio_add_lru(struct folio *folio)
/* see the comment in lru_gen_folio_seq() */
if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
- lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
- folio_set_active(folio);
+ lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) {
+ if (folio_test_workingset(folio))
+ folio_set_active(folio);
+ else if (!folio_test_referenced(folio))
+ folio_mark_accessed(folio);
+ }
folio_batch_add_and_move(folio, lru_add);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e452cb043d46..48355f10542b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -848,7 +848,8 @@ static bool lru_gen_set_refs(struct folio *folio)
return false;
}
- set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));
+ if (folio_test_active(folio))
+ set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));
return true;
}
#else
diff --git a/mm/workingset.c b/mm/workingset.c
index 07e6836d0502..2f0c08aa8623 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -319,11 +319,11 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
- /* see folio_add_lru() where folio_set_active() will be called */
- if (lru_gen_in_fault())
- mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
-
if (workingset) {
+ if (lru_gen_in_fault()) {
+ folio_set_active(folio);
+ mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+ }
folio_set_workingset(folio);
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
} else
--
2.34.1