Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

From: Nhat Pham

Date: Mon Jun 01 2026 - 14:11:05 EST

On Mon, Jun 1, 2026 at 10:45 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Mon, Jun 1, 2026 at 11:57 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> >
> > On Mon, Jun 1, 2026 at 12:34 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > >
> > > For the format part, PHYS don't need that much bits I think,
> > > so by slightly adjust the format vswap device could be share
> > > mostly the same format with ordinary device.
> > >
> > > For example typical modern system don't have a address space larger
> > > than 52 bit. (Even with full 64 bits used for addressing, shift it
> > > by 12 we get 52). Plus 5 for type, you get 57, so you can have a
> > > marker that should work as long as it shorter than 1000000 for PHYS,
> > > and shared for all table format since it's not in conflict with
> > > anything. You have also use a few extra bits so a single swap space
> > > can be 8 times larger than RAM space, and since we can help
> > > multiple swap type I think that should be far than enough?
> > >
> > > Then you have Shadow back at 001, and zero bit in shadow. The only
> > > special one is Zswap, which will be 100 now, and that's exactly the
> > > reserved pointer format in current swap table format, on seeing
> > > si->flags & VSWAP && is_pointer(swp_tb) you know that's zswap :)
> >
> > Are you suggesting we merge the virtual table with main swap table?
> >
> > Man, I'd love to do this. There is a problem though - we have a case
> > where we occupy both backing physical swap AND swap cache. Do you
> > think we can fit both the physical swap slot handle and the swap cache
> > PFN into the same slot in virtual table? Maybe with some expanding...?
>
> I don't really get why we would need to do that? If you put the PFN
> info in the virtual / upper layer, then the count info, locking, and
> all swap IO synchronization (via folio lock), dup (current protected
> by ci lock / folio lock), and allocation (folio_alloc_swap), are all
> handled in this layer.
>
> The physical / lower layer will just hold a reverse entry on
> folio_realloc_swap, or no entry at all (no physical layer used, zswap,
> or after swap allocation but before IO) right?
>
> Looking up the actual folio from the physical layer will be a bit
> slower since it needs to resolve the reverse entry, but the only place
> we need to do that is things like migrate, compaction (none of them
> exist yet) which seems totally fine?

All of this is correct, but consider swaping in a vswap entry backed
by pswap. There are cases where you still want to maintain the pswap
slots around backing vswap entry, while having the swap cache folio as
well.

For e.g, at swap in time, we add the folio into the swap cache. First
of all, we need to hold on to the physical swap slot for IO step. But
even after IO succeeds, there are cases where you would still like to
keep physical swap slots around (for e.g, to avoid swapping out again
if the folio is only speculatively fetched).

So you have to make sure we have space for both the physical swap
slot, and the swap cache folio's PFN at the same time for each vswap
entry. So we still need the vtable extension (well maybe the other
approach I mentioned could work, but I'm not 100% sure).

>
> > > This could be imporved by per-si percpu cluster. Both YoungJun's
> > > tiering and Baoquan's previous swap ops mentioned this is needed,
> > > and now vswap also need that. If the vswap is also a si, then it will
> > > make use of this too.
> >
> > Yeah I made the same recommendation when I review swap tier last week:
> >
> > https://lore.kernel.org/all/CAKEwX=N2XcMHN1jatppOk6wnmz-Shab5XMtTtzgYOzRvU_6YFw@xxxxxxxxxxxxxx/
> >
> > I like it, but yeah it will be complicated. That said, I think not
> > fixing the fast path for tiering/vswap will seriously restrict their
> > usefulness. We don't want to go back to the old swap allocator days :)
>
> Thanks. Not too complicated, actually our internal kernel
> implementation still using si->percpu cluster, and use a counter for
> the rotation and each order have a counter :P, it's a bit ugly but
> works fine. It still serves pretty well just like the global percpu
> cluster, YoungJun's previous per ci percpu cluster also still provides
> the fast path, many ways to do that.

Sounds like something that should be upstreamed? ;)

I was concerned that you might deem the overhead too high (so I came
up with the per-tier percpu caching), but honestly I like it a lot.

>
> >
> > >
> > > YoungJun posted this a few month before:
> > > https://lore.kernel.org/linux-mm/20260131125454.3187546-5-youngjun.park@xxxxxxx/
> > >
> > > The concern is that some locking contention could be heavier, or maybe
> > > that's just a hypothetical problem though.
> >
> > I don't think it's hypothetical. At least with vswap, it's very easy
> > to get into a state where the shared per-cpu cache gets invalidated
> > constantly if phys swap and vswap allocation alternates (which is
> > actually very possible under heavy memory pressure), hammering the
> > slow paths...
>
> I mean if the per-cpu cache is moved to si level, then whoever enters
> the allocation path of a si will almost always get a stable percpu
> cache to use, even if the last used si changes.
>
> We might better have a more explicit per si fast path for that. The
> plist rotation should better be done in a different (will be even
> better if lockless) way.

Exactly! That's what I'm thinking too. Basically I'm special-casing
for vswap here, but if you're happy with generalizing this for every
si I'm happy with it.

>
> >
> > >
> > > >
> > > > - Runtime enable/disable of the vswap device. To be honest, I don't
> > > > know if there is a value in this. My preference is vswap can be
> > > > optimized to the point that any overhead is negligible. Failing that,
> > > > maybe we can come up with some simple heuristics that automatically
> > > > decides for users?
> > > >
> > > > In this RFC, CONFIG_VSWAP=y means the vswap device is always created at
> > > > boot, and CONFIG_VSWAP=n means the vswap device is never created. This
> > > > *might* be enough just on its own.
> > > >
> > > > Is a runtime knob (sysfs or sysctl) worth the complexity beyond
> > > > these heuristics? I'm not sure yet. Maintaining both cases
> > >
> > > I checked the code and I think it's not hard to do, patch 1 already
> > > handling the meta data dynamically, everything will still just work
> > > even if you remove vswap at runtime. The rest of patches need adaption
> > > but might not end up being complex, it other comments here
> > > are considered.
> >
> > Yeah, it's not terribily hard to do. I'm more wondering if it's worth
> > the effort, both for the implementer and the user :)
> >
> > As I said here, if we want vswap, just enable it at boot time and get
> > a vast (but dynamic) device. We can make it optional per-cgroup
> > through Youngjun's interface, and that would be good enough?
> >
> > >
> > > For patch 2, a few routines like vswap_can_swapin_thp seems not
> > > needed or should be moved to __swap_cache_alloc? VSWAP_FOLIO is
> > > same as swap cache folio check, which is already covered. Same for
> > > zero checking, and VSWAP_NONE which is same as swap count check
> > > I think. That way we not only save a lot of code, we also no
> > > longer need to treat vswap specially.
> >
> > Unfortunately, I think a lot of this complexity is still needed. Vswap
> > adds a new layer, which means new complications :)
> >
> > For instance, I think you still need vswap_can_swapin_thp. It
> > basically enforces that the backend must be something
> > swap_read_folio() can handle. That means:
> >
> > 1. No zswap.
> >
> > 2. No mixed backend.
>
> If mixed backend means phys vs zero vs zswap, then we already have
> part of that covered with the current swap cache except for the phys
> part (zswap part seems very doable with fujunjie's work).
> swap_cache_alloc_folio will ensure there is no mixed zerobit, it can
> be easily extended to ensure there is no mixed zswap as well
> (according to what I've learned from fujunjie's code). Similar logic
> for phys detection I think.

Yeah it's basically generalizing that check, and handle the case where
we can have indirection.

I mean I can open-code it, but it has to be there :) And I figure it
might be useful to check this opportunistically (at swap_pte_batch,
even if it's not guaranteed to be correct down the line) before we
even attempt to allocate a large folio etc. to avoid large folio
allocation.

>
> > > For memcg table allocation, on demand seems a good idea, and actually
> > > we are not far from there, I tried to generalize the
> > > alloc-then-retry-sleep-alloc in swap_alloc_table but still not generic
> > > enough I guess.. Good new is the allocation of the table is already
> > > kind of ondemand, just need to split the detection of these two kind
> > > of table.
> >
> > I have a prototype of this, but I have not tested, so I do not want to
> > send it out. :)
> >
> > TLDR is - I still want to record the memcg for vswap (just not charge
> > it towards the counter). So we still need memcg_table at both level,
> > generally - just not allocating until needed (basically if a physical
> > swap slot in the cluster is directly mapped into PTE). You can kinda
> > tell, since you pass the folio into the allocation path - with some
> > care you can distinguish between:
> >
> > 1. Virtual swap, or directly maped physical swap -> need memcg_table
> >
> > 2. Physical swap, backing vswap -> does not memcg_table.
> >
> > Another alternative is you can defer this allocation until the point
> > where you have to do the charging action. But then you have to be
> > careful with failure handling, and need to backoff ya di da di da.
> > Funsies.
> >
> > I think I did a mixed of these 2 strategies. Anyway, I'll include the
> > patch in v2 (if folks like this approach).
> >
> > >
> > > Mean while I also remember we once discussed about splitting the
> > > accounting for vswap / physical swap? If we went that approach we
> > > don't need to treat memcg_table specially.
> >
> > For the charging behavior, I already have a patch for it actually in
> > this series (just not the dynamic allocation of the memcg_table field
> > yet).
> >
> > Basically:
> >
> > 1. For vswap entry, not backed by phys swap: record swap memcg, hold
> > reference to pin the memcg, but not charging towards swap.current.
>
> Maybe you don't need to record memcg here since folio->memcg already
> have that info?
>
> I previously had a patch:
> https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-7-104795d19815@xxxxxxxxxxx/
>
> The defers the recording of memcg, the behavior is almost identical to
> before, but charging & recording should be cleaner and you don't need
> to record memcg at allocation time hence maybe reduce the possibility
> of pinning a memcg. I didn't include that in P4 just to reduce LOC,
> maybe can be resent or included.

That works-ish when the folio is sitll in swap cache, but say if it's
vswap backed by zswap (and the swap cache folio has been reclaimed),
you need a place to store the memcg, no?

Just seems cleaner to centralize this info at vswap layer when it is
presented, for now anyway, rather than juggling this on a per-backend
basis.

>
> > 2. For phys swap backing vswap: charging towards swap.current, but
> > does not record the memcg in its memcg_table, nor does it hold
> > reference to memcg (its vswap entry holds the reference already)
> >
> > 2. For phys swap directly mapped to PTE: charges, records, and holds reference.
> >
> > The motivation here is I do not want vswap entry to shares the same
> > limit as phys swap counter. If we think of it as "infinite" or
> > "dynamic", it should not be capped at all, but even if it is charged,
> > it should be something separate.
>
> Good to know, it's not too messy to make everything dynamic, but if
> our ultimate goal is to account for them separately, maybe we can save
> some effort here.

It's not terrible, TBH.