Re: [PATCH v4 05/12] mm, swap: unify large folio allocation

From: Kairui Song

Date: Sat May 16 2026 - 14:27:37 EST

On Fri, May 15, 2026 at 6:11 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@xxxxxxxxxx> wrote:
>
> From: Kairui Song <kasong@xxxxxxxxxxx>
>
> Now that direct large order allocation is supported in the swap cache,
> both anon and shmem can use it instead of implementing their own methods.
> This unifies the fallback and swap cache check, which also reduces the
> TOCTOU race window of swap cache state: previously, high order swapin
> required checking swap cache states first, then allocating and falling
> back separately. Now all these steps happen in the same compact loop.
>
> Order fallback and statistics are also unified, callers just need to
> check and pass the acceptable order bitmask.
>
> There is basically no behavior change. This only makes things more
> unified and prepares for later commits. Cgroup and zero map checks can
> also be moved into the compact loop, further reducing race windows and
> redundancy
>
> Acked-by: Chris Li <chrisl@xxxxxxxxxx>
> Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
> ---
> mm/memory.c | 77 ++++++------------------------
> mm/shmem.c | 95 ++++++++++---------------------------
> mm/swap.h | 30 ++----------
> mm/swap_state.c | 143 ++++++++++----------------------------------------------
> mm/swapfile.c | 3 +-
> 5 files changed, 68 insertions(+), 280 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 6edb23b41bac..e3edc0c20e34 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -159,7 +159,7 @@ static unsigned long shmem_default_max_inodes(void)
>
> static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> struct folio **foliop, enum sgp_type sgp, gfp_t gfp,
> - struct vm_area_struct *vma, vm_fault_t *fault_type);
> + struct vm_fault *vmf, vm_fault_t *fault_type);
>
> static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
> {
> @@ -2017,68 +2017,25 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> }
>
> static struct folio *shmem_swap_alloc_folio(struct inode *inode,
> - struct vm_area_struct *vma, pgoff_t index,
> + struct vm_fault *vmf, pgoff_t index,
> swp_entry_t entry, int order, gfp_t gfp)
> {
> + pgoff_t ilx;
> + struct folio *folio;
> + struct mempolicy *mpol;
> + /* Always allow order 0 so swap won't fail under pressure. */
> + unsigned long orders = BIT(order) | BIT(0);
> struct shmem_inode_info *info = SHMEM_I(inode);
> - struct folio *new, *swapcache;
> - int nr_pages = 1 << order;
> - gfp_t alloc_gfp = gfp;
> -
> - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> - if (WARN_ON_ONCE(order))
> - return ERR_PTR(-EINVAL);
> - } else if (order) {
> - /*
> - * If uffd is active for the vma, we need per-page fault
> - * fidelity to maintain the uffd semantics, then fallback
> - * to swapin order-0 folio, as well as for zswap case.
> - * Any existing sub folio in the swap cache also blocks
> - * mTHP swapin.
> - */
> - if ((vma && unlikely(userfaultfd_armed(vma))) ||
> - !zswap_never_enabled() ||
> - non_swapcache_batch(entry, nr_pages) != nr_pages)
> - goto fallback;
>
> - alloc_gfp = thp_shmem_limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
> - }
> -retry:
> - new = shmem_alloc_folio(alloc_gfp, order, info, index);
> - if (!new) {
> - new = ERR_PTR(-ENOMEM);
> - goto fallback;
> - }
> + if ((vmf && unlikely(userfaultfd_armed(vmf->vma))) ||
> + !zswap_never_enabled())
> + orders = BIT(0);
>
> - if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
> - alloc_gfp, entry)) {
> - folio_put(new);
> - new = ERR_PTR(-ENOMEM);
> - goto fallback;
> - }
> + mpol = shmem_get_pgoff_policy(info, index, order, &ilx);
> + folio = swapin_sync(entry, gfp, orders, vmf, mpol, ilx);
> + mpol_cond_put(mpol);
>
> - swapcache = swapin_folio(entry, new);
> - if (swapcache != new) {
> - folio_put(new);
> - if (!swapcache) {
> - /*
> - * The new folio is charged already, swapin can
> - * only fail due to another raced swapin.
> - */
> - new = ERR_PTR(-EEXIST);
> - goto fallback;
> - }
> - }
> - return swapcache;
> -fallback:
> - /* Order 0 swapin failed, nothing to fallback to, abort */
> - if (!order)
> - return new;
> - entry.val += index - round_down(index, nr_pages);
> - alloc_gfp = gfp;
> - nr_pages = 1;
> - order = 0;
> - goto retry;
> + return folio;
> }
>

Sashiko reported a problem on this:

When shmem_swap_alloc_folio() computes the interleave index (ilx) for
MPOL_INTERLEAVE and MPOL_WEIGHTED_INTERLEAVE NUMA policies, it passes the
original large swap entry order to shmem_get_pgoff_policy().
If the allocation falls back to smaller orders (like order-0) inside
swap_cache_alloc_folio(), will this ilx be reused for all those fallback
allocations?
Since the calculation of ilx incorporates the original order, reusing the
same interleave index for all 512 fallback pages of a 2MB swap entry might
force them all onto the exact same NUMA node. Does this defeat the intended
page-by-page interleaving policy and potentially cause memory bandwidth
bottlenecks?

===

I initialially thought this is trivial. ilx is already somewhat broken
if we are doing fallback. shmem_get_pgoff_policy() computes ilx =
i_ino + (index >> order). The shift makes sense of all folios are in
the same order: an unshifted ilx = i_ino + index would give index %
nnodes == 0 for every folio on power-of-2 node counts for THP so
shifting by order will ensure interleave is still effective.

However, once a file is backed with mixed-order folios due to fallback
or case, the shift becomes order-dependent and the ilx mapping is no
longer monotonic. The calculated interleave is skewed already.

It deserves a separate look as that's a pre-exist seperate problem. I
think I better not change that, as it might cause confusion. The
problem can be solved (or ignored?) later as it's not critical, the
ilx is just a hint anyway. For now I'll just squash the following
change to keep the behavior identical to before (hoist the fallback to
shmem just like before):

diff --git a/mm/shmem.c b/mm/shmem.c
index e3edc0c20e34..4427661ab2ee 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2023,19 +2023,26 @@ static struct folio
*shmem_swap_alloc_folio(struct inode *inode,
pgoff_t ilx;
struct folio *folio;
struct mempolicy *mpol;
- /* Always allow order 0 so swap won't fail under pressure. */
- unsigned long orders = BIT(order) | BIT(0);
struct shmem_inode_info *info = SHMEM_I(inode);

if ((vmf && unlikely(userfaultfd_armed(vmf->vma))) ||
!zswap_never_enabled())
- orders = BIT(0);
+ order = 0;

+again:
mpol = shmem_get_pgoff_policy(info, index, order, &ilx);
- folio = swapin_sync(entry, gfp, orders, vmf, mpol, ilx);
+ folio = swapin_sync(entry, gfp, BIT(order), vmf, mpol, ilx);
mpol_cond_put(mpol);

- return folio;
+ if (!IS_ERR(folio))
+ return folio;
+
+ if (order) {
+ order = 0;
+ goto again;
+ }
+
+ return NULL;
}

/*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 946ec4ae9ae1..ce4e8c39ed12 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -652,7 +652,7 @@ static struct folio
*swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
* if needed. @entry is rounded down if @orders allow large allocation.
*
* Context: Caller must ensure @entry is valid and pin the swap
device with refcount.
- * Return: Returns the folio on success, NULL if failed.
+ * Return: Returns the folio on success, error code if failed.
*/
struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, unsigned long orders,
struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx)
@@ -667,7 +667,7 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t
gfp, unsigned long orders,
} while (IS_ERR(folio) && PTR_ERR(folio) == -EEXIST);

if (IS_ERR(folio))
- return NULL;
+ return folio;

swap_read_folio(folio, NULL);
return folio;