Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release

From: Yosry Ahmed

Date: Fri May 08 2026 - 16:13:15 EST

()

On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
>
> Swap freeing can be expensive when unmapping a VMA containing many swap
> entries. This has been reported to significantly delay memory reclamation
> during Android's low-memory killing, especially when multiple processes
> are terminated to free memory, with slot_free() accounting for more than
> 80% of the total cost of freeing swap entries.
>
> Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> to asynchronously collect and free swap entries [1][2], but the design
> itself is fairly complex.
>
> When anon folios and swap entries are mixed within a process, reclaiming
> anon folios from killed processes helps return memory to the system as
> quickly as possible, so that newly launched applications can satisfy
> their memory demands. It is not ideal for swap freeing to block anon
> folio freeing. On the other hand, swap freeing can still return memory
> to the system, although at a slower rate due to memory compression.
>
> This series introduces a callback-based deferred free framework in
> zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> define what gets buffered and how it gets drained. The entire free
> path including caller-side bookkeeping (slot_free, zswap_entry_free)
> is deferred to a background worker.

How much of the speedup comes from avoiding the per-class lock,
free_zspage(), other work in zswap, etc.

I ask because I think the design here is still fairly complex. I don't
like how zswap and zram are registering callbacks into zsmalloc to do
their own freeing work, and they fill the buffers on behalf of
zsmalloc which seems like a layering violation.

I wonder how much of the speedup we get by just deferring
free_zspage()? That part can be done much more simply by just putting
the pages on a per-class list and having an async worker or a kthread
consume them and batch-free them. If the rest of zs_free() is also
expensive, we can do the deferred freeing on that level although it
would be more complicated as we need to have a fixed size buffer to
store them and handle running out of space.

A breakdown of where the slowdown is coming from would be helpful to
understand what to focus on.

>
> Implementation:
> - Each CPU owns a single-page buffer. The hot path writes a value
> via the push callback with preemption disabled (no locks).
> - When the buffer fills, it is swapped with a fresh page from a
> pre-allocated page pool. The full page is queued to a WQ_UNBOUND
> worker for drain.
> - The drain callback performs the actual expensive work (zs_free,
> slot_free, zswap_entry_free, etc.) in batch, off the hot path.
> - If no free page is available, the caller falls back to synchronous
> processing.
>
> The speedup comes from moving expensive swap slot freeing off the
> munmap hot path into a background worker, so that intact anonymous
> folios are released back to the system without blocking. The worker
> drains at a slower rate since compressed objects are small and freeing
> a single handle may not release an entire page until the zspage is
> fully empty.
>
> Performance results (Raspberry Pi 4B, ARM64, 8GB RAM):
>
> Test 1: munmap latency for 256MB swap-filled VMA (zram backend)
>
> mode Base Patched Speedup
> single 61.82ms 8.62ms 7.17x
> multi 2p 94.75ms 54.11ms 1.75x
> multi 3p 154.64ms 104.83ms 1.48x
>
> Test 2: munmap latency for different sizes (zram, single process)
>
> Size Base Patched Speedup
> 64MB 14.11ms 2.18ms 6.47x
> 128MB 29.45ms 4.48ms 6.57x
> 192MB 43.85ms 6.62ms 6.62x
> 256MB 57.01ms 9.08ms 6.28x
> 512MB 115.13ms 55.58ms 2.07x
> 1024MB 229.66ms 153.28ms 1.50x
>
> Test 3: munmap latency for 256MB swap-filled VMA (zswap backend)
>
> mode Base Patched Speedup
> single 152.14ms 51.26ms 2.97x
> multi 2p 186.56ms 105.42ms 1.77x
> multi 3p 205.83ms 153.32ms 1.34x
>
> Test 4: munmap latency for different sizes (zswap, single process)
>
> Size Base Patched Speedup
> 64MB 37.83ms 13.26ms 2.85x
> 128MB 75.11ms 26.73ms 2.81x
> 256MB 150.78ms 52.97ms 2.85x
> 512MB 303.04ms 130.38ms 2.32x
> 1024MB 599.95ms 287.10ms 2.09x
>
> [1] https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@xxxxxxxx/
> [2] https://lore.kernel.org/all/20250909065349.574894-1-liulei.rjpt@xxxxxxxx/
> [3] https://lore.kernel.org/linux-mm/20260412060450.15813-1-baohua@xxxxxxxxxx/
>
> Changes since v2:
> - Use per-cpu single-page buffers instead of a global list; the hot
> path only writes into the local CPU's buffer with preemption disabled
> - Add a page pool for buffer rotation: when the current buffer is full,
> swap it with a free page from the pool and queue the full page for
> drain
> - Introduce push/drain callback ops so that zram and zswap can each
> define their own element size and drain logic (zram stores u32 slot
> indices, zswap stores unsigned long handles)
> - Drop the lock optimization patches it will be submitted separately
> as part of a dedicated zsmalloc lock contention series
> - Link to v2: https://lore.kernel.org/all/20260421121616.3298845-1-haowenchao@xxxxxxxxxx/
>
> Barry Song (1):
> zram: use zsmalloc deferred free callback for async slot free
>
> Wenchao Hao (3):
> mm/zsmalloc: introduce deferred free framework with callback ops
> mm/zswap: use zsmalloc deferred free callback for async invalidate
> zram: batch clear flags in slot_free with single write
>
> drivers/block/zram/zram_drv.c | 44 ++++++-
> drivers/block/zram/zram_drv.h | 6 +
> include/linux/zsmalloc.h | 16 +++
> mm/zsmalloc.c | 208 +++++++++++++++++++++++++++++++++-
> mm/zswap.c | 38 ++++++-
> 5 files changed, 306 insertions(+), 6 deletions(-)
>
> --
> 2.34.1
>