Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg

From: Hao Jia

Date: Mon Jun 01 2026 - 07:10:08 EST

On 2026/5/30 09:24, Yosry Ahmed wrote:

On Tue, May 26, 2026 at 07:45:58PM +0800, Hao Jia wrote:

From: Hao Jia <jiahao1@xxxxxxxxxxx>

The zswap background writeback worker shrink_worker() uses a global
cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin
across the online memcgs under root_mem_cgroup.

Proactive writeback also wants a similar per-memcg cursor that is
scoped to the specified memcg, so that repeated invocations against
the same memcg make forward progress across its descendant memcgs
instead of restarting from the first child memcg each time.

Is this a problem in practice?

Is the concern the overhead of scanning memcgs repeatedly, or lack of
fairness? I wonder if we should just do writeback in batches from all
memcgs, similar to how reclaim does it, then evaluate at the end if we
need to start over?

Not using a per-cgroup cursor will cause issues for "repeated small-budget calls" cases. For example, repeatedly triggering a 2MB writeback might result in only writing back pages from the first few child memcgs every time. In the worst-case scenario (where the writeback amount is less than WB_BATCH), it might only ever write back from the first child memcg.

Similar to how memory reclaim uses mem_cgroup_iter() (via struct mem_cgroup_reclaim_iter) and the old shrink_worker() used zswap_next_shrink, we need a shared cursor here.

Naturally, group the cursor and its protecting spinlock into a
zswap_wb_iter struct, and make it a member of struct mem_cgroup to
realize per-memcg cursor management. Accordingly, shrink_worker() now
uses the lock and cursor in root_mem_cgroup->zswap_wb_iter.

If we really need to have per-memcg cursors (I am not a big fan), I
think we can minimize the overhead by making the cursor updates use
atomic cmpxchg instead of having a per-memcg lock.

Because mem_cgroup_iter() always calls css_put(&prev->css), we cannot simply update zswap_wb_iter.pos via cmpxchg() after calling it. Doing so could lead to a double css_put() issue on prev->css.

Therefore, if we switch to the cmpxchg() approach, we wouldn't be able to reuse the existing mem_cgroup_iter() logic. We would have to write a new function similar to cgroup_iter(), and its implementation might end up looking a bit obscure/complex.

Currently, this lock is only used in shrink_memcg(), proactive writeback, and mem_cgroup_css_offline(). Note that shrink_memcg() only acquires the lock of the root cgroup, and mem_cgroup_css_offline() is unlikely to be a hot path.

So, should we keep the spin_lock or go with the cmpxchg() approach?
Yosry and Nhat, what are your thoughts on this?

Because the cursor is now per-memcg, the offline cleanup must visit
every ancestor that could be holding a reference to the dying memcg.
Factor out __zswap_memcg_offline_cleanup() and walk from dead_memcg up
to the root.

Another reason why I don't like per-memcg cursors. There is too much
complexity and I wonder if it's warranted. If we stick with per-memcg
cursors please do the refactoring in separate patches to make the
patches easier to review.

Sorry about that. I will try to keep each patch as simple as possible in the next version.

Thanks,
Hao