Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback

From: Nhat Pham

Date: Wed Jun 03 2026 - 14:54:06 EST

On Wed, Jun 3, 2026 at 11:43 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
>
> On Wed, Jun 3, 2026 at 11:34 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> >
> > On Wed, Jun 3, 2026 at 11:26 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
> > >
> > > > > > Is the main difference that we are scanning in batches here? I think we
> > > > > > can have shrink_memcg() do that too. If anything, it might make the
> > > > > > shrinker more efficient. Over-reclaim is ofc a concern, and especially
> > > > > > in the zswap_store() path as the overhead can be noticeable. Maybe we
> > > > > > can parameterize the batch size based on the code path.
> > > > > >
> > > > > > Nhat, what do you think?
> > > > >
> > > > > Nhat, since we now have the referenced-based second chance algorithm,
> > > > > should we consider doing batch writeback for shrink_memcg() as well?
> > > >
> > > > I just take a look at shrink_memcg() and realized it's reclaiming one
> > > > page at a time. Thanks for the reminder - I hated it.
> > > >
> > > > Please batchify it if it makes your life easier :) We don't reclaim
> > > > "just one page/object" anywhere else in the reclaim path, Sure, it
> > > > adds a bit of latency to zswap_store() if we reached cgroup limit, but
> > > > IMHO if we hit zswap.max limit at zswap_store() time, that is already
> > > > the slowest of slow path that you should have avoided with proactive
> > > > reclaim/zswap shrinker in the first place. And, setting zswap.max does
> > > > not make sense to me in the first place (I can write a whole essay
> > > > about it).
> > >
> > > Should we batchify shrink_memcg() from the shrinker and background
> > > writeback, but leave the synchronous zswap_store() path to reclaim one
> > > page for this series at least to avoid potential regressions?
> > >
> > > I think this change specifically needs more intensive testing (vs the
> > > other code paths).
> >
> > I'm fine with having shrink_memcg() takes a batch_size argument for now :)
> >
> > I suspect not a lot of people invokes the shrink_memcg() synchronous
> > path in zswap store though. Setting zswap.max is hard (as it involves
> > guessing compression ratio ahead of time) and induces quite a bit of
> > overhead (obj_cgroup_may_zswap() does a force flush for every store if
> > you set zswap.max to a value other than 0 and max).
>
> Yeah I agree, but I just wanna make sure we don't completely kill
> performance in case anyone is actually doing that.
>
> Regarding the force flush, it's unfortunate that we need to rely on a
> stat for internal limit checking. It might make sense to use page
> counters here, which already support hierarchical charging and
> whatnot. We'll need something similar to the objcg approach to charge
> sub-pages (or maybe just reuse that somehow?).

Conceptually, I've moved towards not capping zswap.max, and using it
only as a binary knob for enablement/disablement :)

It's consuming the same resources it saves, so if we cap memory.max
(and swap.max until we have vswap), then we already have isolation.
The split between uncompressed/compressed should then be dynamically
determined by workload characteristics (working set size, access
pattern, compressibility) and service's SLO, and reclaimers
(preferably proactive reclaimers) will drive us towards the optimal
split/equilibrium.