Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path

From: Ojaswin Mujoo

Date: Tue Jun 02 2026 - 01:39:39 EST

On Tue, Jun 02, 2026 at 11:22:12AM +0800, Zhang Yi wrote:
> On 6/2/2026 2:33 AM, Ojaswin Mujoo wrote:
> > On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
> > > On 5/30/2026 3:22 PM, Zhang Yi wrote:
> > > > Hi, Ojaswin!
> > > >
> > > > On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
> > > > > On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> > > > > > From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > > > > >
> > > > > > For append writes, wait for ordered I/O to complete before updating
> > > > > > i_disksize. This ensures that zeroed data is flushed to disk before the
> > > > > > metadata update, preventing stale data from being exposed during
> > > > > > unaligned post-EOF append writes.
> > > > > >
> > > > > > Suggested-by: Jan Kara <jack@xxxxxxx>
> > > > > > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > > > > > ---
> > > > > > fs/ext4/ext4.h | 11 +++++++
> > > > > > fs/ext4/inode.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> > > > > > fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> > > > > > fs/ext4/super.c | 23 ++++++++++----
> > > > > > 4 files changed, 161 insertions(+), 13 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > > > > index 078feda47e36..9ce2128eea3e 100644
> > > > > > --- a/fs/ext4/ext4.h
> > > > > > +++ b/fs/ext4/ext4.h
> > > > > > @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
> > > > > > #ifdef CONFIG_FS_ENCRYPTION
> > > > > > struct fscrypt_inode_info *i_crypt_info;
> > > > > > #endif
> > > > > > +
> > > > > > + /*
> > > > > > + * Track ordered zeroed data during post-EOF append writes, fallocate,
> > > > > > + * and truncate-up operations. These parameters are used only in the
> > > > > > + * iomap buffered I/O path.
> > > > > > + */
> > > > > > + ext4_lblk_t i_ordered_lblk;
> > > > > > + ext4_lblk_t i_ordered_len;
> > > > > > + wait_queue_head_t i_ordered_wq;
> > > > > > };
> > > > > > /*
> > > > > > @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> > > > > > __u64 len, __u64 *moved_len);
> > > > > > /* page-io.c */
> > > > > > +#define EXT4_IOMAP_IOEND_ORDER_IO 1UL /* This I/O is an ordered one */
> > > > > > +
> > > > > > extern int __init ext4_init_pageio(void);
> > > > > > extern void ext4_exit_pageio(void);
> > > > > > extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > > index e013aeb03d7b..11fb369efeb1 100644
> > > > > > --- a/fs/ext4/inode.c
> > > > > > +++ b/fs/ext4/inode.c
> > > > > > @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> > > > > > {
> > > > > > struct iomap_ioend *ioend = wpc->wb_ctx;
> > > > > > struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> > > > > > + ext4_lblk_t start, end, order_lblk, order_len;
> > > > > > /*
> > > > > > * After I/O completion, a worker needs to be scheduled when:
> > > > > > @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> > > > > > test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
> > > > > > ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> > > > > > + /*
> > > > > > + * Mark the I/O as ordered. Ordered I/O requires separate endio
> > > > > > + * handling and must not be merged with regular I/O operations.
> > > > > > + */
> > > > > > + order_len = READ_ONCE(ei->i_ordered_len);
> > > > > > + if (order_len) {
> > > > > > + /*
> > > > > > + * Pair with smp_store_release() in ext4_block_zero_eof().
> > > > > > + * Ensure we see the updated i_ordered_lblk that was written
> > > > > > + * before the release store to i_ordered_len.
> > > > > > + */
> > > > > > + smp_rmb();
> > > > > > + order_lblk = READ_ONCE(ei->i_ordered_lblk);
> > > > > > + start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> > > > > > + end = EXT4_B_TO_LBLK(ioend->io_inode,
> > > > > > + ioend->io_offset + ioend->io_size);
> > > > > > +
> > > > > > + if (start <= order_lblk && end >= order_lblk + order_len) {
> > > > >
> > > > > Hi Zhang,
> > > > >
> > > > > I guess this check is enough cause ordered_lblk and ordered_len will
> > > > > always be contained in a single block.
> > > >
> > > > Yeah.
> > > >
> > > > >
> > > > > > + ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> > > > > > + ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> > > > > > + ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
> > > > >
> > > > > FWIU, we are wanting the ordered IO to not be merged and submitted asap
> > > > > since we want to wake up the waiters. Is there any other reason?
> > > >
> > > > My original intention was to prevent the loss of the
> > > > EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
> > > > completion, which could be caused by merging an ordered ioend with a
> > > > normal ioend. In patch 19, we need to determine the flag to update
> > > > i_disksize to the correct position.
> >
> > Ahh okay, we don't want the flag to be lost.
> >
> > > >
> > > > >
> > > > > Adding the boundary in ->writeback_submit() only affects
> > > > > iomap_ioend_can_merge() which happens after we have woken up the waiters
> > > > > and deferred the IO to the wq. We ideally want it affect
> > > > > iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
> > > > > ->writeback_range().
> > > >
> > > > IIUC, merging into the same ioend during the submission stage doesn't
> > > > seem to cause any problems.
> >
> > Got it since the flag is set later. I was thinking we want to quickly
> > issue the ordered IO to wake up the waiters and not waste time trying to
> > merge it and hence we wanted to use that flag.
> >
> > > >
> > > > >
> > > > > Secondly, I don't think boundary is the right flag here. It ensures
> > > > > that everything before the ordered iomap gets submitted and the ordered
> > > > > iomap starts a new ioend. This can still keep getting merged with the
> > > > > newer ioends untils we decide to submit the IO, which can delay waking
> > > > > up the waiters. If we really want the "no merge" behavior, we'll have to
> > > > > do something like [1] (Check the 2 NOMERGE flag patches).
> > > >
> > > > Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
> > > > still cannot prevent merging. I missed this, thank you for pointing this
> > > > out. However, I think perhaps we should change iomap_ioend_can_merge()
> > > > to check the iomap_ioend->io_private. Something like:
> > > >
> > > > if (ioend->io_private || next->io_private)
> > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > ioend->io_private != next->io_private
> >
> > I guess if the purpose is to just not lose the flag, then boundary seems
> > to work for because we only lose the flag if ordered ioend backward
> > merges to a prev one. Flag is retained if we forward merge. Which
> > boundary seems to take care of.
> >
> Yes, IOMAP_IOEND_BOUNDARY is indeed worked currently as it prevents flag
> loss. However, from the perspective of the iomap infrastructure, I
> believe it is still necessary to add the ioend->io_private !=
> next->io_private check. Because ioends with different io_private values
> should not be merged, as this carries the risk of potentially losing
> io_private or even memory leaks. With this check in iomap, we would no
> longer need IOMAP_IOEND_BOUNDARY.

I agree that even outside this patchset it seems like a sane thing to
do.
>
> > However, if we want to avoid merges so we can quickly issue IO and wake
> > up the waiters then the above change looks good. Also, if this is the
> > reason we'd also want to have this during submission stage so the flag
> > setting will probs have to move to ->wirteback_range()
>
> Yes. Issuing ordered I/O as soon as possible is beneficial as it reduces
> the latency of sync file range. Suppose when we are syncing data beyond
> the ordered range, the background writeback process has already started
> committing and bundled the ordered range into a large ioend (up to
> IOEND_BATCH_SIZE folios), then this sync operation will indeed
> experience significant latency. However, for other non-sync scenarios,
> there should be little benefit.

Yes that's true.

>
> But I'm not sure if this is strictly necessary, because in the existing
> implementation, issuing ordered I/O via data=ordered mode works the same
> way — it also doesn't issue ordered I/O as soon as possible, and still
> has to wait when encountering concurrent background writeback. So I
> think we can keep the current implementation for now and see user
> feedback to decide whether further optimization is needed.

I agree!

Thanks,
ojaswin
>
> Cheers,
> Yi.