Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback

From: Nhat Pham

Date: Wed Jun 03 2026 - 14:23:53 EST

On Wed, Jun 3, 2026 at 4:27 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:
>
>
>
> On 2026/5/30 09:37, Yosry Ahmed wrote:
> > On Tue, May 26, 2026 at 07:45:59PM +0800, Hao Jia wrote:
> >> From: Hao Jia <jiahao1@xxxxxxxxxxx>
> >>
> >> Zswap currently writes back pages to backing swap reactively, triggered
> >> either by the shrinker or when the pool reaches its size limit. There is
> >> no mechanism to control the amount of writeback for a specific memory
> >> cgroup. However, users may want to proactively write back zswap pages,
> >> e.g., to free up memory for other applications or to prepare for
> >> memory-intensive workloads.
> >>
> >> Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup
> >> interface. When specified, this key bypasses standard memory reclaim
> >> and exclusively performs proactive zswap writeback up to the requested
> >> budget. If omitted, the default reclaim behavior remains unchanged.
> >>
> >> Example usage:
> >> # Write back 100MB of pages from zswap to the backing swap
> >> echo "100M zswap_writeback_only" > memory.reclaim
> >>
> >> Note that the actual amount written back may be less than requested due
> >> to the zswap second-chance algorithm: referenced entries are rotated on
> >> the LRU on the first encounter and only written back on a second pass.
> >> If fewer bytes are written back than requested, -EAGAIN is returned,
> >> matching the existing memory.reclaim semantics.
> >>
> >> Internally, extend user_proactive_reclaim() to parse the new
> >> "zswap_writeback_only" token and invoke the dedicated handler. Add
> >> zswap_proactive_writeback() to walk the target memcg subtree via the
> >> per-memcg writeback cursor, draining per-node zswap LRUs through
> >> list_lru_walk_one() with the shrink_memcg_cb() callback.
> >>
> >> Suggested-by: Yosry Ahmed <yosry@xxxxxxxxxx>
> >> Suggested-by: Nhat Pham <nphamcs@xxxxxxxxx>
> >> Signed-off-by: Hao Jia <jiahao1@xxxxxxxxxxx>
> >> ---
> >> Documentation/admin-guide/cgroup-v2.rst | 18 +++-
> >> Documentation/admin-guide/mm/zswap.rst | 11 +-
> >> include/linux/zswap.h | 7 ++
> >> mm/vmscan.c | 14 +++
> >> mm/zswap.c | 138 ++++++++++++++++++++++++
> >> 5 files changed, 185 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> >> index 6efd0095ed99..6564abf0dec5 100644
> >> --- a/Documentation/admin-guide/cgroup-v2.rst
> >> +++ b/Documentation/admin-guide/cgroup-v2.rst
> >> @@ -1425,9 +1425,10 @@ PAGE_SIZE multiple when read back.
> >>
> >> The following nested keys are defined.
> >>
> >> - ========== ================================
> >> + ==================== ==================================================
> >> swappiness Swappiness value to reclaim with
> >> - ========== ================================
> >> + zswap_writeback_only Only perform proactive zswap writeback
> >> + ==================== ==================================================
> >>
> >> Specifying a swappiness value instructs the kernel to perform
> >> the reclaim with that swappiness value. Note that this has the
> >> @@ -1437,6 +1438,19 @@ The following nested keys are defined.
> >> The valid range for swappiness is [0-200, max], setting
> >> swappiness=max exclusively reclaims anonymous memory.
> >>
> >> + The zswap_writeback_only key skips ordinary memory reclaim and
> >> + writes back pages from zswap to the backing swap device until
> >> + the requested amount has been written or no further candidates
> >> + are found. This is useful to proactively offload cold pages from
> >> + the zswap pool to the swap device. It is only available if
> >> + zswap writeback is enabled. zswap_writeback_only cannot be combined
> >> + with swappiness; specifying both returns -EINVAL.
> >> +
> >> + Example::
> >> +
> >> + # Write back up to 100MB of pages from zswap to the backing swap
> >> + echo "100M zswap_writeback_only" > memory.reclaim
> >
> >
> > memcg folks need to chime in about the interface here. An alternative
> > would be a separate interface (e.g. memory.zswap.do_writeback or
> > memory.zswap.writeback.reclaim or sth).
> >
> >> diff --git a/mm/zswap.c b/mm/zswap.c
> >> index 73e64a635690..7bcbf788f634 100644
> >> --- a/mm/zswap.c
> >> +++ b/mm/zswap.c
> >> @@ -1679,6 +1679,144 @@ int zswap_load(struct folio *folio)
> >> return 0;
> >> }
> >>
> >> +/*
> >> + * Maximum LRU scan limit:
> >> + * number of entries to scan per page of remaining budget.
> >> + */
> >> +#define ZSWAP_PROACTIVE_WB_SCAN_RATIO 16UL
> >> +/*
> >> + * Batch size for proactive writeback:
> >> + * - As the per-memcg writeback target in the outer memcg loop.
> >> + * - As the per-walk budget passed to list_lru_walk_one().
> >> + */
> >> +#define ZSWAP_PROACTIVE_WB_BATCH 128UL
> >> +
> >> +/*
> >> + * Walk the per-node LRUs of @memcg to write back up to @nr_to_write pages.
> >> + * Returns the number of pages written back, or -ENOENT if @memcg is a
> >> + * zombie or has writeback disabled.
> >> + */
> >> +static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg,
> >> + unsigned long nr_to_write)
> >> +{
> >> + unsigned long nr_written = 0;
> >> + int nid;
> >> +
> >> + if (!mem_cgroup_zswap_writeback_enabled(memcg))
> >> + return -ENOENT;
> >> +
> >> + if (!mem_cgroup_online(memcg))
> >> + return -ENOENT;
> >> +
> >> + for_each_node_state(nid, N_NORMAL_MEMORY) {
> >> + bool encountered_page_in_swapcache = false;
> >> + unsigned long nr_to_scan, nr_scanned = 0;
> >> +
> >> + /*
> >> + * Cap by LRU length: bounds rewalks when referenced
> >> + * entries keep rotating to the tail.
> >> + */
> >> + nr_to_scan = list_lru_count_one(&zswap_list_lru, nid, memcg);
> >> + if (!nr_to_scan)
> >> + continue;
> >> +
> >> + /*
> >> + * Cap by SCAN_RATIO * remaining budget: bounds scan cost
> >> + * to the remaining writeback budget.
> >> + */
> >> + nr_to_scan = min(nr_to_scan,
> >> + (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO);
> >> +
> >> + while (nr_scanned < nr_to_scan) {
> >> + unsigned long nr_to_walk = min(ZSWAP_PROACTIVE_WB_BATCH,
> >> + nr_to_scan - nr_scanned);
> >> +
> >> + if (signal_pending(current))
> >> + return nr_written;
> >> +
> >> + /*
> >> + * Account for the committed budget rather than the walker's
> >> + * actual delta. If the list is emptied concurrently, the
> >> + * walker visits nothing and nr_scanned would never advance.
> >> + */
> >> + nr_scanned += nr_to_walk;
> >> +
> >> + nr_written += list_lru_walk_one(&zswap_list_lru, nid, memcg,
> >> + &shrink_memcg_cb,
> >> + &encountered_page_in_swapcache,
> >> + &nr_to_walk);
> >> +
> >> + if (nr_written >= nr_to_write)
> >> + return nr_written;
> >> + if (encountered_page_in_swapcache)
> >> + break;
> >> +
> >> + cond_resched();
> >> + }
> >> + }
> >> +
> >> + return nr_written;
> >> +}
> >> +
> >> +int zswap_proactive_writeback(struct mem_cgroup *memcg,
> >> + unsigned long nr_to_writeback)
> >> +{
> >> + struct mem_cgroup *iter_memcg;
> >> + unsigned long nr_written = 0;
> >> + int failures = 0, attempts = 0;
> >> +
> >> + if (!memcg)
> >> + return -EINVAL;
> >> + if (!nr_to_writeback)
> >> + return 0;
> >> +
> >> + /*
> >> + * Writeback will be aborted with -EAGAIN if we encounter
> >> + * the following MAX_RECLAIM_RETRIES times:
> >> + * - No writeback-candidate memcgs found in a subtree walk.
> >> + * - A writeback-candidate memcg wrote back zero pages.
> >> + */
> >> + while (nr_written < nr_to_writeback) {
> >> + unsigned long batch_size;
> >> + long shrunk;
> >> +
> >> + if (signal_pending(current))
> >> + return -EINTR;
> >> +
> >> + iter_memcg = zswap_mem_cgroup_iter(memcg);
> >> +
> >> + if (!iter_memcg) {
> >> + /*
> >> + * Continue without incrementing failures if we found
> >> + * candidate memcgs in the last subtree walk.
> >> + */
> >> + if (!attempts && ++failures == MAX_RECLAIM_RETRIES)
> >> + return -EAGAIN;
> >> + attempts = 0;
> >> + continue;
> >> + }
> >> +
> >> + batch_size = min(nr_to_writeback - nr_written,
> >> + ZSWAP_PROACTIVE_WB_BATCH);
> >> + shrunk = zswap_proactive_shrink_memcg(iter_memcg, batch_size);
> >> + mem_cgroup_put(iter_memcg);
> >> +
> >> + /* Writeback-disabled or offline: skip without counting. */
> >> + if (shrunk == -ENOENT)
> >> + continue;
> >> +
> >> + ++attempts;
> >> + if (shrunk > 0)
> >> + nr_written += shrunk;
> >> + else if (++failures == MAX_RECLAIM_RETRIES)
> >> + return -EAGAIN;
> >> +
> >> + cond_resched();
> >> + }
> >> +
> >> + return 0;
> >> +}
> >> +
> >
> > There is a lot of copy+paste from shrink_worker() and shrink_memcg()
> > here. We really should be able to reuse shrink_memcg().
> >
>
> I will do some consolidation and code reuse in the next version.
>
> > Is the main difference that we are scanning in batches here? I think we
> > can have shrink_memcg() do that too. If anything, it might make the
> > shrinker more efficient. Over-reclaim is ofc a concern, and especially
> > in the zswap_store() path as the overhead can be noticeable. Maybe we
> > can parameterize the batch size based on the code path.
> >
> > Nhat, what do you think?
>
> Nhat, since we now have the referenced-based second chance algorithm,
> should we consider doing batch writeback for shrink_memcg() as well?

I just take a look at shrink_memcg() and realized it's reclaiming one
page at a time. Thanks for the reminder - I hated it.

Please batchify it if it makes your life easier :) We don't reclaim
"just one page/object" anywhere else in the reclaim path, Sure, it
adds a bit of latency to zswap_store() if we reached cgroup limit, but
IMHO if we hit zswap.max limit at zswap_store() time, that is already
the slowest of slow path that you should have avoided with proactive
reclaim/zswap shrinker in the first place. And, setting zswap.max does
not make sense to me in the first place (I can write a whole essay
about it).