Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Yosry Ahmed

Date: Mon Apr 27 2026 - 14:18:05 EST

On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
>
> On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> >
> > Swap freeing can be expensive when unmapping a VMA containing
> > many swap entries. This has been reported to significantly
> > delay memory reclamation during Android's low-memory killing,
> > especially when multiple processes are terminated to free
> > memory, with slot_free() accounting for more than 80% of
> > the total cost of freeing swap entries.
> >
> > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > to asynchronously collect and free swap entries [1][2], but the
> > design itself is fairly complex.
> >
> Hi Nhat, Kairui, Barry, Xueyuan,
>
> Thanks for the review. I agree with the direction and have some ideas for
> an alternative approach.
>
> My approach: first eliminate pool->lock from zs_free() itself, then defer
> free to per-cpu buffers with a lockless handoff, and finally reduce
> class->lock overhead during drain by exploiting natural class locality.
> Achieving both per-cpu and per-class is difficult, so the class->lock
> optimization is a compromise — but one that works well in practice.
>
> 1. Encode class_idx in obj to eliminate pool->lock
>
> OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
> (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
> for obj_idx, leaving 14 spare bits.
> We can split OBJ_INDEX into class_idx + obj_idx:
>
> obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]
>
> OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
> (8 bits for 4K pages, 9 for 64K).
> Since class_idx is invariant across migration (only PFN changes), zs_free()
> can extract class_idx locklessly, then acquire class->lock and re-read obj for a
> stable PFN. No pool->lock needed.

How much of the benefit do we get with just these locking improvements
without having to defer any of the freeing work?

As others have pointed out, I don't want to just defer expensive work
without understanding why it's expensive and running into limitations
about why it cannot be improved without deferring.