Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

From: Nhat Pham

Date: Mon Jun 01 2026 - 12:39:21 EST

On Mon, Jun 1, 2026 at 8:56 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Mon, Jun 1, 2026 at 12:34 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> >
> > On Thu, May 28, 2026 at 02:29:24PM +0800, Nhat Pham wrote:
> > > Based on: mm-unstable @ 444fc9435e57 + swap-table phase IV v5 [2].
> > >
> > > I manually adapted Kairui's ghost device implementation (from [4])
> > > for my vswap device. I've credited him as Co-developed-by on Patch I
> > > since a substantial portion of the dynamic-cluster infrastructure is
> > > his (I did propose the idea of using xarray/radix tree for dynamic
> > > swap clusters allocation and management though :P).
> > >
> > > >From here on out, for simplicity, I will refer to swap table phase IV
> > > as "P4", and the older v6 virtual swap space implementation as "v6".
> > >
> >
> > ...
> >
> > >
> > > This series reimplements the virtual swap space concept (see [1])
> > > on top of Kairui Song's swap table infrastructure, on top of [2]
> > > and in accordance with his proposal in [3]. The proposal's idea
> > > is interesting, so I decided to give it a shot myself. I'm still not
> > > 100% sure that this is bug-proof, but hey, it compiles, and has
> > > not crashed in my simple stress testing :)
> > >
> > > The prototype here is feature-complete relative to the swap-table P4
> > > baseline — swapout, swapin, freeing, swapoff, zswap writeback, zswap
> > > shrinker, memcg charging, and THP swapin all work for
> > > both vswap and direct-physical entries — and satisfies all three
> > > requirements above: no backend coupling (zswap/zero entries hold no
> > > physical slot), dynamic swap space (clusters allocated on demand via
> > > xarray, no static provisioning), and efficient backend transfer
> > > (in-place vtable updates, no PTE/rmap walking).
> > >
> > > II. Design
> > >
> > > With vswap, pages are assigned virtual swap entries on a ghost device
> > > with no backing storage. These entries are backed by zswap, zero pages,
> > > or (lazily) physical swap slots. Physical backing is allocated only
> > > when needed — on zswap writeback or reclaim writeout, after the rmap
> > > step.
> > >
> > > Compared to the standalone v6 implementation [1], which introduces a
> > > 24-byte per-entry swap descriptor and its own cluster allocator, this
> > > edition uses swap_table infrastructure, and share a lot of the allocator
> > > logic. Per-slot metadata is stored in a tag-encoded virtual_table
> > > (atomic_long_t, 8 bytes per slot), and physical clusters store
> > > Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> > > the virtual cluster.
> > >
> > > Here are some data layout diagrams:
> > >
> > > Case 1: vswap entry (virtualized)
> > >
> > > PTE swap_cluster_info_dynamic
> > > vswap_entry +-------------------------+
> > > (swp_entry_t) ------>| swap_cluster_info (ci) |
> > > | +--------------------+ |
> > > | | swap_table | |
> > > | | PFN / Shadow | |
> > > | | memcg_table | |
> > > | | count,flags,order | |
> > > | | lock, list | |
> > > | +--------------------+ |
> > > | |
> > > | virtual_table |
> > > | +--------------------+ |
> > > | | NONE | |
> > > | | PHYS | |
> > > | | ZERO | |
> > > | | ZSWAP(entry*) | |
> > > | | FOLIO(folio*) | |
> > > | +--------------------+ |
> > > +-------------------------+
> > > |
> > > | PHYS resolves to
> > > v
> > > PHYSICAL CLUSTER (swap_cluster_info)
> > > +--------------------------+
> > > | swap_table per-slot: |
> > > | NULL - free |
> > > | PFN - cached folio |
> > > | Shadow - swapped out |
> > > | Pointer- vswap rmap |
> > > | Bad - unusable |
> > > | |
> > > | Vswap-backing slot: |
> > > | Pointer(C|swp_entry_t) |
> > > | rmap back to vswap |
> > > +--------------------------+
> > >
> > > Case 2: direct-mapped physical entry (no vswap)
> > >
> > > PTE PHYSICAL CLUSTER (swap_cluster_info)
> > > phys_entry +--------------------------+
> > > (swp_entry_t) ------>| swap_table per-slot: |
> > > | NULL - free |
> > > | PFN - cached folio |
> > > | Shadow - swapped out |
> > > | Bad - unusable |
> > > +--------------------------+
> > >
> > > struct swap_cluster_info_dynamic {
> > > struct swap_cluster_info ci; /* swap_table, lock, etc. */
> > > unsigned int index; /* position in xarray */
> > > struct rcu_head rcu; /* kfree_rcu deferred free */
> > > atomic_long_t *virtual_table; /* backend info, 8 B/slot */
> > > };
> > >
> > > Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> > > swap_cluster_info struct with a virtual_table array that stores the
> > > backend information for each virtual swap entry in the cluster. Each
> > > entry is tag-encoded in the low 3 bits to indicate backend types:
> > >
> > > NONE: |----- 0000 ------|000| free / unbacked
> > > PHYS: |-- (type:5,off:N)|001| on a physical swapfile (shifted)
> > > ZERO: |----- 0000 ------|010| zero-filled page
> > > ZSWAP: |--- zswap_entry* |011| compressed in zswap
> > > FOLIO: |--- folio* ------|100| in-memory folio
> >
> > Thanks for trying this approach!
>
> Thanks for the suggestions. I hope going forward we have sth concrete
> to tinker with, rather than abstractions :P
>
> >
> > For the format part, PHYS don't need that much bits I think,
> > so by slightly adjust the format vswap device could be share
> > mostly the same format with ordinary device.
> >
> > For example typical modern system don't have a address space larger
> > than 52 bit. (Even with full 64 bits used for addressing, shift it
> > by 12 we get 52). Plus 5 for type, you get 57, so you can have a
> > marker that should work as long as it shorter than 1000000 for PHYS,
> > and shared for all table format since it's not in conflict with
> > anything. You have also use a few extra bits so a single swap space
> > can be 8 times larger than RAM space, and since we can help
> > multiple swap type I think that should be far than enough?
> >
> > Then you have Shadow back at 001, and zero bit in shadow. The only
> > special one is Zswap, which will be 100 now, and that's exactly the
> > reserved pointer format in current swap table format, on seeing
> > si->flags & VSWAP && is_pointer(swp_tb) you know that's zswap :)
>
> Are you suggesting we merge the virtual table with main swap table?
>
> Man, I'd love to do this. There is a problem though - we have a case
> where we occupy both backing physical swap AND swap cache. Do you
> think we can fit both the physical swap slot handle and the swap cache
> PFN into the same slot in virtual table? Maybe with some expanding...?
>
> Another option is we can be a bit smart about it - if a virtual swap
> entry is in swap cache AND occupies physical swap slot, then put the
> folio at the physical swap's table, use folio->swap as the rmap.
>
> (I think you recommend this approach somewhere but for the life of me
> I can't find the reference - apologies if I'm putting words into your
> mouth :))
>
> But this is a bit more complicated - extra care is needed for rmap
> handling at the physical swap layer, and swap cache handling at the
> virtual swap layer. Maybe a follow-up? :)
>
> >
> > Folio / PFN can still be 010 as in the current swap table format.
> >
> > Then everything seems clean and aligned, no more special handling
> > for vswap needed, there are detailed to sort out, but it should work.
> >
> > > - Pointer-tagged swap_table on physical clusters for rmap (physical
> > > -> virtual) lookup.
> >
> > Or reuse the PHYS format (rename it maybe) since point back to vswap
> > is also pointing to a si.
>
> Noted. I'm just doing the simplest thing right now - working
> prototype. I mean, we have enough bits :)
>
> >
> > > III. Follow-ups:
> > >
> > > In no particular order (and most of which can be done as follow-up
> > > patch series rather than shoving everything in the initial landing):
> > >
> > > - More thorough stress testing is very much needed.
> > >
> > > - Performance benchmarks to make sure I don't accidentally regress
> > > the vswap-less case, and that the vswap's case performance is
> > > good. I suspect I will have to port a lot of the
> > > optimizations I implemented in v6 over here - some of the
> > > inefficiencies are inherent in any swap virtualization, and
> > > would require the same fix (for e.g the MRU cluster caching
> > > for faster cluster lookup - see [8] and [9]).
> >
> > This could be imporved by per-si percpu cluster. Both YoungJun's
> > tiering and Baoquan's previous swap ops mentioned this is needed,
> > and now vswap also need that. If the vswap is also a si, then it will
> > make use of this too.

Oh and the MRU cluster caching I mentioned here is not the allocation
caching. It's the lookup caching, basically to avoid doing the
xa_load() to look up clusters for consecutive swap operations on the
same vswap cluster (which is the common case with vswap). For v6, it
massively reduces this indirection lookup overhead. Performance-wise
it's an absolute winner, just more complexity (because I need to
handle reference counting carefully).

I also just realized we'll induce the indirection overhead on
allocation here too, even if the cached cluster still have slots for
allocation, because we look up the cluster (which is basically free
for static swap device, but not free for vswap devices). Might need to
take care of that to maintain vswap performance (but it will then
diverge from your existing code...).