Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg

From: Hao Jia

Date: Wed Jun 03 2026 - 21:58:50 EST

On 2026/6/4 01:53, Yosry Ahmed wrote:

On Wed, Jun 03, 2026 at 11:02:54AM +0800, Hao Jia wrote:

On 2026/6/3 07:19, Yosry Ahmed wrote:

Proactive writeback also wants a similar per-memcg cursor that is
scoped to the specified memcg, so that repeated invocations against
the same memcg make forward progress across its descendant memcgs
instead of restarting from the first child memcg each time.

Is this a problem in practice?

Is the concern the overhead of scanning memcgs repeatedly, or lack of
fairness? I wonder if we should just do writeback in batches from all
memcgs, similar to how reclaim does it, then evaluate at the end if we
need to start over?

Not using a per-cgroup cursor will cause issues for "repeated small-budget
calls" cases. For example, repeatedly triggering a 2MB writeback might
result in only writing back pages from the first few child memcgs every
time. In the worst-case scenario (where the writeback amount is less than
WB_BATCH), it might only ever write back from the first child memcg.

Right, so a fairness concern?

I wonder if we should just reclaim a batch from each memcg, then check
if we reached the goal, otherwise start over. If the batch size is small
enough that should work?

Even with a small batch size, for small writeback requests triggered by
user-space (e.g., 2MB, which is batch size * N), it might still repeatedly
write back from only the first N child memcgs.

Yes, I understand, I am asking if this is a problem in practice. For
this to be a problem we'd need to trigger small writeback requests and
have many memcgs.

This could cause the user-space agent to prematurely give up on zswap
writeback.

Why? The kernel should not return before trying to writeback from all
memcgs. If we scan the first N child memcgs and did not writeback
enough, we should keep going, right?

Yes, this issue is not caused by the kernel, but rather by our user-space
agent itself.

For instance, suppose a parent memcg has two children, memcg1 and memcg2,
each with 200MB of zswap (100MB inactive). Triggering proactive writeback on
the parent memcg will exhaust memcg1's inactive zswap pages. After that,
even though memcg2 still has plenty of inactive zswap pages, it will
continue to write back memcg1's active zswap pages. Writing back active
zswap pages causes the user-space agent to prematurely abort the writeback
because it detects that certain memcg metrics have exceeded predefined
thresholds.

This will only happen if the reclaim size is smaller than the batch
size, right? Otherwise the kernel should reclaim more or less equally
from both memcgs?

I gave it some thought. Not using a cursor could lead to unfairness issues with certain writeback sizes:

- If the writeback size is an odd multiple of WB_BATCH (e.g., triggering a writeback of 3 * WB_BATCH), with 2 child cgroups, the writeback ratio might end up being 2:1.
- If a memcg has 5 child cgroups and a writeback of 2 * WB_BATCH is triggered, it might repeatedly write back from only the first 2 child cgroups.

Although setting a smaller WB_BATCH might mitigate this unfairness, it could hurt writeback efficiency. Let's just use per-memcg cursors to completely fix these corner cases.

Thanks,
Hao

Of course, real-world scenarios are much more complex, and this kind of case
is extremely rare in our environment.

That being said, your suggestion of using the global lock for the per-memcg
cursors makes the writeback fairer and would resolve these corner cases.

Right, but I'd rather not do per-memcg cursors at all if we can avoid
it. Will using batches help make reclaim fair over all memcgs without a
cursor?

We can always add the cursor later if needed.