Re: [PATCH v5 00/21] Virtual Swap Space

From: Kairui Song

Date: Mon Mar 23 2026 - 06:18:42 EST

On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> This patch series is based on 6.19. There are a couple more
> swap-related changes in mainline that I would need to coordinate
> with, but I still want to send this out as an update for the
> regressions reported by Kairui Song in [15]. It's probably easier
> to just build this thing rather than dig through that series of
> emails to get the fix patch :)
>
> Changelog:
> * v4 -> v5:
> * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> and use guard(rcu) in vswap_cpu_dead
> (reported by Peter Zijlstra [17]).
> * v3 -> v4:
> * Fix poor swap free batching behavior to alleviate a regression
> (reported by Kairui Song).

I tested the v5 (including the batched-free hotfix) and am still
seeing significant regressions in both sequential and concurrent swap
workloads

Thanks for the update as I can see It's a lot of thoughtful work.
Actually I did run some tests already with your previously posted
hotfix based on v3. I didn't update the result because very
unfortunately, I still see a major performance regression even with a
very simple setup.

BTW there seems a simpler way to reproduce that, just use memhog:
sudo mkswap /dev/pmem0; sudo swapon /dev/pmem0; time memhog 48G; sudo swapoff -a

Before:
(I'm using fish shell on that test machine so this is fish time format):
________________________________________________________
Executed in 20.80 secs fish external
usr time 5.14 secs 0.00 millis 5.14 secs
sys time 15.65 secs 1.17 millis 15.65 secs
________________________________________________________
Executed in 21.69 secs fish external
usr time 5.31 secs 725.00 micros 5.31 secs
sys time 16.36 secs 579.00 micros 16.36 secs
________________________________________________________
Executed in 21.86 secs fish external
usr time 5.39 secs 1.02 millis 5.39 secs
sys time 16.46 secs 0.27 millis 16.46 secs

After:
________________________________________________________
Executed in 30.77 secs fish external
usr time 5.16 secs 767.00 micros 5.16 secs
sys time 25.59 secs 580.00 micros 25.59 secs
________________________________________________________
Executed in 37.47 secs fish external
usr time 5.48 secs 0.00 micros 5.48 secs
sys time 31.98 secs 674.00 micros 31.98 secs
________________________________________________________
Executed in 31.34 secs fish external
usr time 5.22 secs 0.00 millis 5.22 secs
sys time 26.09 secs 1.30 millis 26.09 secs

It's obviously a lot slower.

pmem may seem rare but SSDs are good at sequential, and memhog uses
the same filled page and backend like ZRAM has extremely low overhead
for same filled pages. Results with ZRAM are very similar, and many
production workloads have massive amounts of samefill memory.

For example on the Android phone I'm using right now at this moment:
# cat /sys/block/zram0/mm_stat
4283899904 1317373036 1370259456 0 1475977216 116457 1991851
87273 1793760
~450M of samefill page in ZRAM, we may see more on some server
workload. And I'm seeing similar memhog results with ZRAM, pmem is
just easier to setup and less noisy. also simulates high speed
storage.

I also ran the previous usemem matrix, which seems better than V3 but
still pretty bad:
Test: usemem --init-time -O -n 1 56G, 16G mem, 48G swap, avgs of 8 run.
Before:
Throughput (Sum): 528.98 MB/s Throughput (Mean): 526.113333 MB/s Free
Latency: 3037932.888889
After:
Throughput (Sum): 453.74 MB/s Throughput (Mean): 454.875000 MB/s Free
Latency: 5001144.500000 (~10%, 64% slower)

I'm not sure why our results differ so much — perhaps different LRU
settings, memory pressure ratios, or THP/mTHP configs? Here's my exact
config in the attachment. Also includes the full log and info, with
all debug options disabled for close to production. I ran it 8 times
and just attached the first result log, it's all similar anyway, my
test framework reboot the machine after each test run to reduce any
potential noise.

And the above tests are only about sequential performance, concurrent
ones seem worse:
Test: usemem --init-time -O -R -n 32 622M, 16G mem, 48G swap, avgs of 8 run.
Before:
Throughput (Sum): 5467.51 MB/s Throughput (Mean): 170.04 MB/s Free
Latency: 28648.65
After:
Throughput (Sum): 4914.86 MB/s Throughput (Mean): 152.74 MB/s Free
Latency: 67789.81 (~10%, 230% slower)

And I double checked I'm testing your latest V5 commit here:
commit 9114ebedb82089ebd3519854964c73d3959b10c0 (HEAD -> upstream/vswap)
Author: Nhat Pham <nphamcs@xxxxxxxxx>
Date: Fri Mar 20 12:27:35 2026 -0700

vswap: batch contiguous vswap free calls

In vswap_free(), we release and reacquire the cluster lock for every
single entry, even for non-disk-swap backends where the lock drop is
unnecessary. Batch consecutive free operations to avoid this overhead.

Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>

The two kernels being tested:
/boot/vmlinuz-6.19.0.orig-g05f7e89ab973
/boot/vmlinuz-6.19.0.ptch-g9114ebedb820

These tests above are done with an EPYC 7K62, I also setup an Intel
8255C with fresh installed upstream Fedora, and using Fedora's kernel
config. So far the result matches, the gap seems smaller but still
>20% slower for many cases, so this is a common problem:

3 test run on 8255C using fresh installed Fedora and Fedora kernel config:
taskset -c 3 /usr/local/bin/usemem --init-time -O -n 1 112G
(That's a two nodes large machine so I pin the thread on CPU 3 for stability)

Before:
135291469824 bytes / 124326887 usecs = 1062687 KB/s
2157355 usecs to free memory
135291469824 bytes / 123930024 usecs = 1066090 KB/s
2244083 usecs to free memory
135291469824 bytes / 123484528 usecs = 1069936 KB/s
2268364 usecs to free memory

After:
135291469824 bytes / 127073712 usecs = 1039716 KB/s
3050394 usecs to free memory
135291469824 bytes / 130724757 usecs = 1010677 KB/s
3064270 usecs to free memory
135291469824 bytes / 127248347 usecs = 1038289 KB/s
3035986 usecs to free memory

And beside these known cases, my main concern is still that a
mandatory virtual layer seems just wrong, it changes how swap work in
many ways. Storage folks have been trying to bypass the kernel for
decades, as abstraction layers come with overhead — that's common
knowledge. Swap lives right at the intersection of storage and mm and
has to stay inside the kernel, so we really want the kernel path to be
as flat and direct as possible.

I'm also worried this risks undoing all the recent and upcoming work
for reducing memory usage and performance. We've been trying to shrink
per-entry overhead (I'm already feeling nervous over the current
8-byte per-entry cost, and hope soon we'll get down to <1–3 bytes).
The series mentions 24 bytes of overhead, but when I account for the
reverse mapping, it looks >32 bytes per entry.

The intermediate large XArray layer also worries me as the swap space
is now very large. The virtual size could grow with no limit. e.g. a 1
TB swap space would be a 4 layers radix tree, increasing global
contention (int(1024 * 1024 / 2) >> 6 >> 6 >> 6 == 2) and vswap could
be even larger if fragmentation happens. That's the exact problem the
old sub-address_space design for SWAP was created to solve. We only
eliminated that complexity a few months ago, and this approach seems
like it would have to bring a similar structure back to reduce
contention.

And for swapoff support: minor anonymous faults during busy periods
are indeed critical for some workloads, and being able to swapoff
cleanly is still very useful both for performance and troubleshooting.
You will need to touch many things to solve a minor fault.

For reference, I've been exploring an approach that keeps the virtual
layer runtime-optional, which avoids these overheads for workloads
that don't need virtualization:
https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-0-104795d19815@xxxxxxxxxxx/

Attachment: config-n-log.tar
Description: application/tar