Re: [REGRESSION] [PATCH] ceph: fix num_ops OBOE when crypto allocation fails

From: Sam Edwards

Date: Mon Mar 16 2026 - 16:15:27 EST

On Mon, Mar 16, 2026 at 10:44 AM Viacheslav Dubeyko
<Slava.Dubeyko@xxxxxxx> wrote:
>
> On Sun, 2026-03-15 at 16:25 -0700, Sam Edwards wrote:
> > move_dirty_folio_in_page_array() may fail if the file is encrypted, the
> > dirty folio is not the first in the batch, and it fails to allocate a
> > bounce buffer to hold the ciphertext. When that happens,
> > ceph_process_folio_batch() simply redirties the folio and flushes the
> > current batch -- it can retry that folio in a future batch.
> >
>
> How this issue can be reproduced? Do you have a reproduction script or anything
> like this?

Good day Slava,

Is this question about the preceding paragraph? If so: that paragraph
is just describing current (and intended) behavior, not an issue.

If this is just a general question about the patch, then I don't know
of a way to trigger the issue in a short timeframe, but something like
this ought to work:
1. Create a reasonably-sized (e.g. 4GiB) fscrypt-protected file in CephFS
2. Put the CephFS client system under heavy memory pressure, so that
bounce page allocation is more likely to fail
3. Repeatedly write to the file in a 4KiB-written/4KiB-skipped
pattern, starting over upon getting to the end of the file
4. Wait for the system to panic, gradually ramping up the memory
pressure until it does

I run a workload that performs fairly random I/O atop CephFS+fscrypt.
Before this patch, I'd get a panic after about a day. After this
patch, I've been running for 4+ days without this particular issue
reappearing.

> > However, if this failed folio is not contiguous with the last folio that
> > did make it into the batch, then ceph_process_folio_batch() has already
> > incremented `ceph_wbc->num_ops`; because it doesn't follow through and
> > add the discontiguous folio to the array, ceph_submit_write() -- which
> > expects that `ceph_wbc->num_ops` accurately reflects the number of
> > contiguous ranges (and therefore the required number of "write extent"
> > ops) in the writeback -- will panic the kernel:
> >
> > BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);
>
> I don't quite follow. We decrement ceph_wbc->num_ops but BUG_ON() operates by
> req->r_num_ops. How req->r_num_ops receives the value of ceph_wbc->num_ops?

ceph_submit_write() passes ceph_wbc->num_ops to ceph_osdc_new_request()...

> We change ceph_wbc->num_ops, ceph_wbc->offset, and ceph_wbc->len here:
>
> } else if (!is_folio_index_contiguous(ceph_wbc, folio)) {
> if (is_num_ops_too_big(ceph_wbc)) {
> folio_redirty_for_writepage(wbc, folio);
> folio_unlock(folio);
> break;
> }
>
> ceph_wbc->num_ops++;
> ceph_wbc->offset = (u64)folio_pos(folio);
> ceph_wbc->len = 0;
> }
>
> First of all, technically speaking, move_dirty_folio_in_page_array() can fail
> even if is_folio_index_contiguous() is positive. Do you mean that we don't need
> to decrement the ceph_wbc->num_ops in such case?

Yes, exactly: as stated in the commit message, we only need to correct
the value "when move_dirty_folio_in_page_array() fails, but the folio
already started counting a new (i.e. still-empty) extent." The `len ==
0` test is checking for that new/still-empty condition.

> Secondly, do we need to correct ceph_wbc->offset?

No, we do not; the valid lifetime of offset/len ends when
ceph_process_folio_batch() returns. I'd even argue they don't belong
in ceph_wbc at all and should be local variables instead, but that's a
matter for a different patch.

Cheers,
Sam