Re: [PATCH v5 00/21] Virtual Swap Space

From: Kairui Song

Date: Fri Apr 24 2026 - 14:59:08 EST

On Sat, Apr 25, 2026 at 2:08 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
>
> On Thu, Apr 23, 2026 at 9:16 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> >
> > Yosry Ahmed <yosry@xxxxxxxxxx> 于 2026年4月24日周五 04:48写道：
> > > > Using a swapfile does have its benefits, though. For example, the
> > > > virtual layer could act as an ordinary tier following YoungJun's
> > > > design:
> > > > https://lore.kernel.org/linux-mm/20260421055323.940344-1-youngjun.park@xxxxxxx/
> > >
> > > Hmm I didn't look too closely at this but I don't understand how
> > > making it a swapfile helps with tiering? If anything, I think it makes
> > > tiering more difficult. For tiering to work, we need an
> > > abstraction/redirection layer, such that we don't need to update the
> > > page tables (or shmem pagecache) if we demote/promote pages. That is
> > > exactly the use case for a virtual swap layer. The page tables point
> > > at a virtual swap ID and the backend could change transparently (e.g.
> > > for zswap writeback, or tiering).
> > >
> > > If we make the virtual layer a swapfile, how do we demote/promote
> > > without updating page tables?
> > >
> > > IOW, I think the whole reason we want a virtual layer is to separate
> > > the backends, which would facilitate tiering. If the virtual layer is
> > > itself a swapfile, wouldn't it become one of the tiers?
> >
> > That's exactly what I hoped, virtual layer being part of the tier.
> > Tier could be set up per task / cgroup. So is the virtual tier.
>
> Just to clarify. I don't think virtual swap should be one of the
> tiers. I think it should be the mechanism through which we implement
> tiering (see above). I am not sure if that's what you meant.

YoungJun's swap tier have been working pretty well without the virtual part:
https://lore.kernel.org/linux-mm/20260421055323.940344-1-youngjun.park@xxxxxxx/

> >
> > A standalone implementation of the virtual layer is more heavy than
> > being a swapfile. Actually I think at this point, it is the word
> > "swapfile" is misleading now. We may rename it to "swap mapping" or
> > something. A swap mapping could be physical or virtual. Virtual
> > mapping can realloc from physical ones (redirect), and swapoff of
> > physical ones just read its data into virtual mapping's swap cache.
>
> I don't understand this part, please clarify. In my mind, all
> references to swap entries from outside backend code should refer to a
> virtual swap ID, which could be pointing to physical swap or zswap or
> something else.

For example just reserve a type (e.g. type 0) as the virtual type?
(type is really a bad naming though).

The that swap file (or swap mapping) will be

I was trying that based on this:
https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-15-104795d19815@xxxxxxxxxxx/

It seems to work and the only thing we need is actually just something
like this one in VSS:
https://lore.kernel.org/linux-mm/20260320192735.748051-15-nphamcs@xxxxxxxxx/

This part:
+ /* fall back to physical swap device */
+ if (!vswap_alloc_swap_slot(folio)) {

We do a folio_realloc_swap if folio->swap have type 0.

Which means, if there is no virtual device / mapping / file / space
(I'm not sure how to name it at this point :) ), the ordinary swap
routine is just still there untouched.

If there is one, and it's being used, then, it is still the ordinary
swap routine, just do an extra allocation (and the extra allocation
strictly follows YoungJun's tier rule), which is same with VSS, but
everything is reused. From a user or high level interface perspective,
this can be designed with no difference as VSS. Just with a few
bonuses: being per memcg / task / runtime optional, zero overhead if
not enabled, and reusing all the infra.

BTW this deferred allocation (in VSS or dynamic swap mapping, similar
thing) is actually a bit concerning to me as well. It changes the
common swapout routine and maybe worth reconsideration (e.g.
activate_locked_split and mTHP stats is now ignored?), being optional
for now also seems safer.

> I *think* what you're saying is that we should make that optional, but
> I don't see how this would work. If a page table is pointing at a swap
> slot in a swapfile, we cannot do tiering or zswap writeback or
> anything dynamic without updating page tables. So even if the system
> starts off with one swapfile, we cannot assume we won't add more and
> set up tiering (or enable zswap) after that, right?
>
> I guess we'll keep the swap table in the swapfile and then we'll have
> it point to a different backend, but I really don't like this design.
> It's unnecessarily complicated in my opinion. Page tables will either
> refer to a virtual swap ID or a physical swap slot.

Or in another word, they are all just swap entries, and the swap layer
handles things internally.

> I think we can simply have swap tables representing the virtual swap
> space and pointing at the backend directly, whether or not we have
> zswap or tiering set up or not. Is the overhead really that bad?

Right... I mean with two layers you will likely have >16 bytes
overhead, and double lookup. And I have been thinking about cutting
down the memory usage to 3 bytes. And you can't make the lower /
physical layer just a bitmap if you want a reverse mapping, and so far
many things do require that. If we make the reverse mapping optional
it might be more complicated than the thing we discussed.

I don't think the thing I described above is that complicated reading
all the code and solutions so far. Maybe some better abstraction can
help?

I've seen some vendors doing swap using UFFD just to cut down the
overhead or having a highly customized backend solution for swap, so I
was hoping the kernel part could be as minimal as possible.