Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

From: Kairui Song

Date: Mon Jun 01 2026 - 03:35:06 EST

On Thu, May 28, 2026 at 02:29:24PM +0800, Nhat Pham wrote:
> Based on: mm-unstable @ 444fc9435e57 + swap-table phase IV v5 [2].
>
> I manually adapted Kairui's ghost device implementation (from [4])
> for my vswap device. I've credited him as Co-developed-by on Patch I
> since a substantial portion of the dynamic-cluster infrastructure is
> his (I did propose the idea of using xarray/radix tree for dynamic
> swap clusters allocation and management though :P).
>
> >From here on out, for simplicity, I will refer to swap table phase IV
> as "P4", and the older v6 virtual swap space implementation as "v6".
>

...

>
> This series reimplements the virtual swap space concept (see [1])
> on top of Kairui Song's swap table infrastructure, on top of [2]
> and in accordance with his proposal in [3]. The proposal's idea
> is interesting, so I decided to give it a shot myself. I'm still not
> 100% sure that this is bug-proof, but hey, it compiles, and has
> not crashed in my simple stress testing :)
>
> The prototype here is feature-complete relative to the swap-table P4
> baseline — swapout, swapin, freeing, swapoff, zswap writeback, zswap
> shrinker, memcg charging, and THP swapin all work for
> both vswap and direct-physical entries — and satisfies all three
> requirements above: no backend coupling (zswap/zero entries hold no
> physical slot), dynamic swap space (clusters allocated on demand via
> xarray, no static provisioning), and efficient backend transfer
> (in-place vtable updates, no PTE/rmap walking).
>
> II. Design
>
> With vswap, pages are assigned virtual swap entries on a ghost device
> with no backing storage. These entries are backed by zswap, zero pages,
> or (lazily) physical swap slots. Physical backing is allocated only
> when needed — on zswap writeback or reclaim writeout, after the rmap
> step.
>
> Compared to the standalone v6 implementation [1], which introduces a
> 24-byte per-entry swap descriptor and its own cluster allocator, this
> edition uses swap_table infrastructure, and share a lot of the allocator
> logic. Per-slot metadata is stored in a tag-encoded virtual_table
> (atomic_long_t, 8 bytes per slot), and physical clusters store
> Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> the virtual cluster.
>
> Here are some data layout diagrams:
>
> Case 1: vswap entry (virtualized)
>
> PTE swap_cluster_info_dynamic
> vswap_entry +-------------------------+
> (swp_entry_t) ------>| swap_cluster_info (ci) |
> | +--------------------+ |
> | | swap_table | |
> | | PFN / Shadow | |
> | | memcg_table | |
> | | count,flags,order | |
> | | lock, list | |
> | +--------------------+ |
> | |
> | virtual_table |
> | +--------------------+ |
> | | NONE | |
> | | PHYS | |
> | | ZERO | |
> | | ZSWAP(entry*) | |
> | | FOLIO(folio*) | |
> | +--------------------+ |
> +-------------------------+
> |
> | PHYS resolves to
> v
> PHYSICAL CLUSTER (swap_cluster_info)
> +--------------------------+
> | swap_table per-slot: |
> | NULL - free |
> | PFN - cached folio |
> | Shadow - swapped out |
> | Pointer- vswap rmap |
> | Bad - unusable |
> | |
> | Vswap-backing slot: |
> | Pointer(C|swp_entry_t) |
> | rmap back to vswap |
> +--------------------------+
>
> Case 2: direct-mapped physical entry (no vswap)
>
> PTE PHYSICAL CLUSTER (swap_cluster_info)
> phys_entry +--------------------------+
> (swp_entry_t) ------>| swap_table per-slot: |
> | NULL - free |
> | PFN - cached folio |
> | Shadow - swapped out |
> | Bad - unusable |
> +--------------------------+
>
> struct swap_cluster_info_dynamic {
> struct swap_cluster_info ci; /* swap_table, lock, etc. */
> unsigned int index; /* position in xarray */
> struct rcu_head rcu; /* kfree_rcu deferred free */
> atomic_long_t *virtual_table; /* backend info, 8 B/slot */
> };
>
> Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> swap_cluster_info struct with a virtual_table array that stores the
> backend information for each virtual swap entry in the cluster. Each
> entry is tag-encoded in the low 3 bits to indicate backend types:
>
> NONE: |----- 0000 ------|000| free / unbacked
> PHYS: |-- (type:5,off:N)|001| on a physical swapfile (shifted)
> ZERO: |----- 0000 ------|010| zero-filled page
> ZSWAP: |--- zswap_entry* |011| compressed in zswap
> FOLIO: |--- folio* ------|100| in-memory folio

Thanks for trying this approach!

For the format part, PHYS don't need that much bits I think,
so by slightly adjust the format vswap device could be share
mostly the same format with ordinary device.

For example typical modern system don't have a address space larger
than 52 bit. (Even with full 64 bits used for addressing, shift it
by 12 we get 52). Plus 5 for type, you get 57, so you can have a
marker that should work as long as it shorter than 1000000 for PHYS,
and shared for all table format since it's not in conflict with
anything. You have also use a few extra bits so a single swap space
can be 8 times larger than RAM space, and since we can help
multiple swap type I think that should be far than enough?

Then you have Shadow back at 001, and zero bit in shadow. The only
special one is Zswap, which will be 100 now, and that's exactly the
reserved pointer format in current swap table format, on seeing
si->flags & VSWAP && is_pointer(swp_tb) you know that's zswap :)

Folio / PFN can still be 010 as in the current swap table format.

Then everything seems clean and aligned, no more special handling
for vswap needed, there are detailed to sort out, but it should work.

> - Pointer-tagged swap_table on physical clusters for rmap (physical
> -> virtual) lookup.

Or reuse the PHYS format (rename it maybe) since point back to vswap
is also pointing to a si.

> III. Follow-ups:
>
> In no particular order (and most of which can be done as follow-up
> patch series rather than shoving everything in the initial landing):
>
> - More thorough stress testing is very much needed.
>
> - Performance benchmarks to make sure I don't accidentally regress
> the vswap-less case, and that the vswap's case performance is
> good. I suspect I will have to port a lot of the
> optimizations I implemented in v6 over here - some of the
> inefficiencies are inherent in any swap virtualization, and
> would require the same fix (for e.g the MRU cluster caching
> for faster cluster lookup - see [8] and [9]).

This could be imporved by per-si percpu cluster. Both YoungJun's
tiering and Baoquan's previous swap ops mentioned this is needed,
and now vswap also need that. If the vswap is also a si, then it will
make use of this too.

YoungJun posted this a few month before:
https://lore.kernel.org/linux-mm/20260131125454.3187546-5-youngjun.park@xxxxxxx/

The concern is that some locking contention could be heavier, or maybe
that's just a hypothetical problem though.

>
> - Runtime enable/disable of the vswap device. To be honest, I don't
> know if there is a value in this. My preference is vswap can be
> optimized to the point that any overhead is negligible. Failing that,
> maybe we can come up with some simple heuristics that automatically
> decides for users?
>
> In this RFC, CONFIG_VSWAP=y means the vswap device is always created at
> boot, and CONFIG_VSWAP=n means the vswap device is never created. This
> *might* be enough just on its own.
>
> Is a runtime knob (sysfs or sysctl) worth the complexity beyond
> these heuristics? I'm not sure yet. Maintaining both cases

I checked the code and I think it's not hard to do, patch 1 already
handling the meta data dynamically, everything will still just work
even if you remove vswap at runtime. The rest of patches need adaption
but might not end up being complex, it other comments here
are considered.

For patch 2, a few routines like vswap_can_swapin_thp seems not
needed or should be moved to __swap_cache_alloc? VSWAP_FOLIO is
same as swap cache folio check, which is already covered. Same for
zero checking, and VSWAP_NONE which is same as swap count check
I think. That way we not only save a lot of code, we also no
longer need to treat vswap specially.

If you keep the format compatible with what we already have
as the earlier comment mentions, a large portion of this part
might be unneeded.

> at runtime also has overhead for checking as well, and some of the
> checks are not cheap :)

I also noticed the new introduced swap_read_folio_phys in patch 3, so
this actually can be done using Baoquan's swapops idea which is now
part of Christoph's swap batching:

https://lore.kernel.org/linux-mm/20260528124559.2566481-9-hch@xxxxxx/#r

That series is focusing on batching and better performance but swapops
was also proposed as a way to solve the virtual layer, makes it possible
to have vswap as one kind of swapops which is Chris talked a lot about:

https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/

Following this, we could have something like:

const struct swap_ops swap_vswap_ops = {
.submit_write = swap_vswap_submit_write,
.submit_read = swap_vswap_submit_read,
};

The move the folio_realloc_swap in swap_vswap_submit_write.

Merge of IO might be moved to lower phyiscal level for vswap.

Another gain is that the memory usage and CPU overhead will be
lower with only one layer. As I'm recently trying to offload swap
dataplane off CPU so the CPU won't touch the data at all, the
overhead will be purely by swap itself, plus some mm overhead.
Things like that and IO optimization above and could make swap
subsystem more and more performance sensitive so we have cases
that needs only one layer.

>
> - Defer per-cluster memcg_table and zeromap allocation on physical
> clusters. A physical swap cluster backing vswap entries only do
> not really need their memcg_table, but the current design forces
> us to allocate it anyway. This is a waste of memory, and is an
> overhead regression compared to my older design on the zswap-only
> case, which Johannes has pointed out multiple times (see [6]),
> and is one of the biggest reasons why I have not been satisfied
> with this approach thus far. It honestly is a bit of a
> deal-breaker...
>
> That said, I think I might be able to allocate them on demand, i.e
> only when the first direct-mapped slot is allocated on that cluster.
> That will give us the best of BOTH worlds, for both the vswap and
> directly-mapped physical swap cases. No promises, but I will try
> (if this approach is good enough for all parties).

Zero map is not really a problem when it's just a inlined bit I think.
For memcg table allocation, on demand seems a good idea, and actually
we are not far from there, I tried to generalize the
alloc-then-retry-sleep-alloc in swap_alloc_table but still not generic
enough I guess.. Good new is the allocation of the table is already
kind of ondemand, just need to split the detection of these two kind
of table.

Mean while I also remember we once discussed about splitting the
accounting for vswap / physical swap? If we went that approach we
don't need to treat memcg_table specially.

> - Widen swap_info_struct->max to unsigned long. The vswap device's
> max is currently clamped to ALIGN_DOWN(UINT_MAX, SWAPFILE_CLUSTER)
> (~16 TiB) to fit in unsigned int. 16 TiB is small for vswap,
> especially when we're getting increasingly big machines memory-wise.

This should be very easy to do, just replace unsigned int with
unsigned long, a lot of place to touch though :)