Re: [RFC PATCH v2 0/9] mm: support zswap-backed large folio swapin

From: Nhat Pham

Date: Fri May 29 2026 - 14:08:45 EST

On Fri, May 29, 2026 at 5:17 AM fujunjie <fujunjie1@xxxxxx> wrote:
>
> Hi,
>
> This RFC explores large-folio swapin for ranges that are still fully backed
> by zswap.
>
> Large swapin is currently disabled once zswap is in the picture. Anonymous
> faults stop considering large orders after zswap has ever been enabled,
> shmem does the same, and zswap_load() refuses large swapcache folios. That
> keeps mixed zswap/disk cases safe, but it also loses the dense case where
> every slot in an aligned 64K range is still resident in zswap.
>
> The series keeps the policy in common swapin code:
>
> - zswap reports backend facts and provides the large-folio load helper.
> - swapin_sync() filters candidate orders by backend range.
> - all-disk and zeromap ranges keep the existing Kairui large-swapin path.
> - mixed zswap/disk ranges stay order-0.
> - all-zswap ranges may use a 64K folio after locality admission.
> - anon provides locality evidence from VMA hints and PTE young density.
> - shmem starts with explicit VMA-hint evidence only.
> - swap readahead uses its existing VMA/cluster window as locality
> evidence; it does not also run the anon PTE-young rule.
>
> The backend range probe is only a snapshot. If the backend changes after a
> fresh large swapcache folio is allocated, the common path drops that folio
> and falls back to order-0. zswap_load() can also return -EAGAIN for the
> same retry path. If a late fault retry keeps the large folio in swapcache
> instead of deleting it, the cgroup v1 memsw swap owner is committed before
> returning.
>
> This is mTHP/large-folio swapin. The mappings installed by do_swap_page()
> are still PTE mappings, not PMD mappings. The expected win is fewer faults,
> batched PTE/rmap work, and preserving the large folio across zswapin
> instead of rebuilding the working set as order-0 pages.
>
> Prior art: Usama Arif posted a related RFC on 2024-10-18:
>
> mm: zswap: add support for zswapin of large folios
> https://lore.kernel.org/linux-mm/20241018105026.2521366-1-usamaarif642@xxxxxxxxx/
>
> This RFC keeps the same broad goal, but moves admission into common swapin
> code. zswap does not decide the policy. Mixed zswap/disk ranges are
> rejected before large IO, and the first cap is 64K.
>
> This is a rewrite of the RFC posted on 2026-05-08:
>
> [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
> https://lore.kernel.org/linux-mm/tencent_8B437BE4F586C162950BF71954316C1EDB05@xxxxxx/
>
> The v1 series was anonymous-only and kept too much of the policy near the
> anon fault and zswap paths. This version is rebuilt on top of Kairui Song's
> common swapin infrastructure. It keeps admission in common swapin code,
> rejects mixed zswap/disk large ranges, and adds separate locality producers
> for anon, shmem and swap readahead.
>
> Performance and behavior
> ========================
>
> The A/B tables are 10-run measurements. Elapsed values are seconds,
> shown as mean +/- sample standard deviation. "phase" or "refault" is the
> measured refault subphase. "zswpin" counts zswap loads. "pswpin" counts
> swap-ins from the real swap device; pswpin=0 means the refaults were served
> by zswap even when a disk swap device was configured. "RFC 64K" is the mean
> number of successful 64K swapins.
>
> The numbers below show where the large path is used and where it is
> rejected.
>
> zram-backed zswap microbench, 64K mTHP, 8G guest:
>
> +-----------------+----------------+----------------+--------+--------+--------+----------+
> | workload | base elapsed | RFC elapsed | delta | phase | zswpin | RFC 64K |
> +-----------------+----------------+----------------+--------+--------+--------+----------+
> | usama_1g | 11.260+/-0.235 | 10.301+/-0.140 | -8.5% | -22.2% | 1.000x | 16381.1 |
> | nohint_seq64 | 4.398+/-0.085 | 4.025+/-0.022 | -8.5% | -21.1% | 1.000x | 6221.1 |
> | seqhint_seq64 | 4.283+/-0.060 | 3.948+/-0.062 | -7.8% | -20.6% | 1.000x | 6223.5 |
> | stride64_sparse | 3.095+/-0.051 | 3.086+/-0.025 | -0.3% | +5.8% | 1.002x | 1.0 |
> | random64_sparse | 3.095+/-0.046 | 3.076+/-0.016 | -0.6% | +0.7% | 1.001x | 0.0 |
> | random64_full | 4.423+/-0.067 | 4.405+/-0.018 | -0.4% | +0.1% | 1.000x | 0.0 |
> +-----------------+----------------+----------------+--------+--------+--------+----------+
>
> The usama_1g row follows the shape of the 2024 RFC benchmark: allocate 1G,
> fill it with compressible per-page data, reclaim it through memory.reclaim,
> then time the full integrity-check refault. The seq64 rows use a 512M
> target and 768M of pressure. "sparse" touches one 4K page per 64K region, while
> "full" touches every 4K page. "seqhint" uses MADV_SEQUENTIAL; "nohint" does
> not.
>
> Virtio-block swap device present, zswap enabled:
>
> +-----------------+---------------+---------------+--------+---------+--------+--------+---------+
> | workload | base elapsed | RFC elapsed | delta | refault | pswpin | zswpin | RFC 64K |
> +-----------------+---------------+---------------+--------+---------+--------+--------+---------+
> | seq64 | 4.399+/-0.100 | 4.279+/-0.216 | -2.7% | -10.5% | 0 | 1.000x | 3110.7 |
> | stride64_sparse | 3.103+/-0.047 | 3.119+/-0.086 | +0.5% | +6.2% | 0 | 0.999x | 0.0 |
> | random64_sparse | 3.142+/-0.112 | 3.097+/-0.030 | -1.4% | -2.2% | 0 | 0.999x | 0.1 |
> | random64_full | 4.473+/-0.147 | 4.445+/-0.088 | -0.6% | +0.9% | 0 | 1.000x | 0.4 |
> +-----------------+---------------+---------------+--------+---------+--------+--------+---------+
>
> This run uses a real block swap device, but the refaulted data stayed in
> zswap. It covers the all-zswap hit path with disk swap configured, not disk
> read IO.
>
> Virtio-block pressure/mixed run, zswap max_pool_percent=1,
> low-compressibility full fill:
>
> +-------------------------------+---------------+---------------+--------+---------+----------------+------------+---------+----------+
> | workload | base elapsed | RFC elapsed | delta | refault | pswpin base/RFC | RFC zswpin | RFC 64K | fallback |
> +-------------------------------+---------------+---------------+--------+---------+----------------+------------+---------+----------+
> | seq64_full_pressure | 5.908+/-0.195 | 5.790+/-0.235 | -2.0% | +3.0% | 90258/99038 | 20327 | 0.0 | 3730 |
> | random64_sparse_full_pressure | 5.104+/-0.069 | 5.068+/-0.090 | -0.7% | -9.1% | 6201/6196 | 1297 | 0.0 | 0 |
> +-------------------------------+---------------+---------------+--------+---------+----------------+------------+---------+----------+
>
> This run reaches the disk-backed path: pswpin is non-zero in both base and
> RFC. It is mainly fallback coverage. The RFC does not install 64K folios
> for these disk/mixed-heavy ranges.

Ok this results above look good. Basically, if we don't have spatial
locality in access patterns, we don't do THP zswapin. Nice.

>
> Policy matrix, virtio-block swap device present:
>
> +------------------------------+----+------+--------+--------+-------+----------+
> | case | pc | hint | pswpin | zswpin | zswpwb| 64K in |
> +------------------------------+----+------+--------+--------+-------+----------+
> | pc0_seq | 0 | none | 0 | 99559 | 0 | 0 |
> | pc3_seq | 3 | none | 0 | 99498 | 0 | 0 |
> | pc4_seq | 4 | none | 0 | 99512 | 0 | 3109 |
> | pc5_seq | 5 | none | 0 | 99657 | 0 | 3113 |
> | hint_none_random_sparse | 5 | none | 0 | 6265 | 0 | 0 |
> | hint_random_seq | 5 | rand | 0 | 99488 | 0 | 0 |
> | mixed_seq_full | 5 | none | 97725 | 20147 | 84 | 569 |
> | mixed_random_sparse_full | 5 | none | 6230 | 1302 | 0 | 0 |
> +------------------------------+----+------+--------+--------+-------+----------+
>
> The pc rows show the readahead-window gate. The hint_random_seq row shows
> the explicit random hint veto. The mixed rows use a small zswap pool to
> force disk/zswap split backing; most mixed ranges are rejected, while any
> remaining 64K successes were all-zswap at the time of the fault.
>
> Kbuild pressure, zram swap, 384M memcg:
>
> +----------------------+----------+----------+--------+--------+----------+
> | setup | base | RFC | delta | zswpin | RFC 64K |
> +----------------------+----------+----------+--------+--------+----------+
> | zram swap, 384M memcg| 2060.323 | 2047.516 | -0.6% | 0.991x | 2797 |
> +----------------------+----------+----------+--------+--------+----------+
>
> This is a single-run zram pressure smoke. It did not show Kbuild
> regression, and the RFC run installed 64K zswap-backed folios. The result
> should not be read as a tuned-performance claim.
>
> Kbuild pressure, virtio-block swap device, 512M memcg:
>
> +-------------------------+----------+----------+--------+--------+----------+
> | setup | base | RFC | delta | pswpin | RFC 64K |
> +-------------------------+----------+----------+--------+--------+----------+
> | disk swap, 512M memcg | 1420.671 | 1409.263 | -0.8% | 0 | 7497 |
> +-------------------------+----------+----------+--------+--------+----------+
>
> This is a single-run pressure smoke. The disk-swap Kbuild run also stayed
> on the all-zswap hit path, so it is pressure coverage with a disk swap device
> present rather than proof of disk-read large swapin.

Why a single-run?

>
> Shmem smoke, tmpfs huge=always, 64K shmem mTHP:
>
> +----------------------------+---------------+---------+-------------+----------+
> | case | refault hint | touched | 64K shmem | 64K in |
> +----------------------------+---------------+---------+-------------+----------+
> | nohint_seq | none | 65536 | 4096 | 0 |
> | seq_refault_hint | sequential | 65536 | 4096 | 4096 |
> | random_refault_hint_sparse | random | 4096 | 4096 | 0 |
> +----------------------------+---------------+---------+-------------+----------+
>
> That matches the current shmem producer: explicit sequential refault hints
> allow large zswap swapin; no hint and random hints do not.
>
> What this RFC does not establish
> ================================
>
> The 64K cap is deliberate, but it is not tuned. The anon PTE-young rule is
> only anon evidence. Shmem has the framework and explicit VMA hints in this
> RFC, not a page-cache locality producer. For larger orders, the anon
> producer should probably use bounded sampling instead of walking every PTE
> in a 1M or larger candidate range. The mixed-backend tests cover fallback
> behavior, but this series does not add mixed zswap/disk large IO.

The mixed IO can be deferred, but I think we should figure out a rule
to extend this hint to arbitrarily sized ranges, and preferrably shmem
too.