Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path

From: Zhang Yi

Date: Sat May 30 2026 - 05:33:30 EST


On 5/28/2026 9:34 PM, Ojaswin Mujoo wrote:
> On Wed, May 27, 2026 at 09:28:28PM +0530, Ojaswin Mujoo wrote:
>> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>
>>> For append writes, wait for ordered I/O to complete before updating
>>> i_disksize. This ensures that zeroed data is flushed to disk before the
>>> metadata update, preventing stale data from being exposed during
>>> unaligned post-EOF append writes.
>>>
>>> Suggested-by: Jan Kara <jack@xxxxxxx>
>>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>> ---
>>> fs/ext4/ext4.h | 11 +++++++
>>> fs/ext4/inode.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>>> fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>>> fs/ext4/super.c | 23 ++++++++++----
>>> 4 files changed, 161 insertions(+), 13 deletions(-)
>>>
[...]
>>> @@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
>>> loff_t from, loff_t end)
>>> {
>>> struct address_space *mapping = inode->i_mapping;
>>> + struct ext4_inode_info *ei = EXT4_I(inode);
>>> struct folio *folio;
>>> bool do_submit = false;
>>> + int ret;
>>>
>>> folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>>> if (IS_ERR(folio))
>>> @@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
>>> folio_wait_writeback(folio);
>>> WARN_ON_ONCE(folio_test_writeback(folio));
>>>
>>> - if (likely(folio_test_dirty(folio)))
>>> + /*
>>> + * Mark the ordered range. It will be cleared upon I/O completion
>>> + * in ext4_iomap_end_bio(). Any operation that extends i_disksize
>>> + * (including append write end io past the zeroed boundary,
>>> + * truncate up and append fallocate) must wait for this I/O to
>>> + * complete before updating i_disksize.
>>> + *
>>> + * When multiple overlapping unaligned EOF writes are in flight, we
>>> + * only need to track and wait for the first one. Subsequent writes
>>> + * will zero the gap in memory and ensure that the zeroed data is
>>> + * written out along with the valid data in the same block before
>>> + * i_disksize is updated.
>>> + */
>>> + if (likely(folio_test_dirty(folio) &&
>>> + READ_ONCE(ei->i_ordered_len) == 0)) {
>>> + WRITE_ONCE(ei->i_ordered_lblk,
>>> + from >> inode->i_blkbits);
>>> + /*
>>> + * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
>>> + * and ext4_iomap_wb_ordered_wait(). Ensure the updated
>>> + * i_ordered_lblk is visible when i_ordered_len becomes
>>> + * non-zero.
>>> + */
>>> + smp_store_release(&ei->i_ordered_len, 1);
>>> do_submit = true;
>>> + }
>>> folio_unlock(folio);
>>> folio_put(folio);
>>>
>>> /* Submit zeroed block. */
>>> - if (do_submit)
>>> - return filemap_fdatawrite_range(mapping, from, end - 1);
>>> + if (do_submit) {
>>> + ret = filemap_fdatawrite_range(mapping, from, end - 1);
>>> + if (ret) {
>>> + /*
>>> + * Pairs with wait_event() in
>>> + * ext4_iomap_wb_ordered_wait(). Ensure
>>> + * i_ordered_len = 0 is visible before waking up
>>> + * waiters.
>>> + */
>>> + smp_store_release(&ei->i_ordered_len, 0);
>>> + wake_up_all(&ei->i_ordered_wq);
>>> + return ret;
>
> Okay so even if the ordered IO fails we still let the i_disksize updates
> go ahead?

Yes when data_err=ignore, no when data_err=abort.

> I think this is a deviation from the current behavior where we
> abort the journal. If this is acceptable we should atleast add a comment
> on why its okay.
>

I think this behavior is consistent with the current data=ordered mode.
In the data_err=ignore mode, if an I/O write fails, ext4_end_io_end()
does not abort the journal, so i_disksize is still updated normally.
Conversely, in the data_err=abort mode, the journal is aborted, and
since i_disksize is not updated, it cannot be updated afterwards. Am I
missing something?

>>> + }
>>> + }
>>> return 0;
>>> }
>>>
>>> @@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>> * data=ordered mode. We submit zeroed range directly here.
>>> * Do not wait for I/O completion for performance.
>>> *
>>> - * TODO: Any operation that extends i_disksize (including
>>> - * append write end io past the zeroed boundary, truncate up,
>>> - * and append fallocate) must wait for the relevant I/O to
>>> - * complete before updating i_disksize.
>>> + * The end_io handler ext4_iomap_wb_ordered_wait() will wait
>>> + * for I/O completion before updating i_disksize if the write
>>> + * extends beyond the zeroed boundary.
>>> + *
>>> + * TODO: Any other operation that extends i_disksize
>>> + * (including truncate up and append fallocate) must wait for
>>> + * the relevant I/O to complete before updating i_disksize.
>>> */
>>> } else if (ext4_inode_buffered_iomap(inode)) {
>>> err = ext4_iomap_submit_zero_block(inode, from, end);
>>> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
>>> index 3050c887329f..ad05ebb49bf6 100644
>>> --- a/fs/ext4/page-io.c
>>> +++ b/fs/ext4/page-io.c
>>> @@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
>>> return 0;
>>> }
>>>
>>> +/*
>>> + * If the old disk size is not block size aligned and the current
>>> + * writeback range is entirely beyond the old EOF block, we should
>>> + * wait for the zeroed data written in ext4_block_zero_eof() to be
>>> + * written out, otherwise, it may expose stale data in that block.
>>> + */
>>> +static void ext4_iomap_wb_ordered_wait(struct inode *inode,
>>> + loff_t pos, loff_t end)
>>> +{
>>> + struct ext4_inode_info *ei = EXT4_I(inode);
>>> + unsigned int blocksize = i_blocksize(inode);
>>> + loff_t disksize = READ_ONCE(ei->i_disksize);
>>> + ext4_lblk_t order_lblk, order_len;
>>> +
>>> + /*
>>> + * Waiting for ordered I/O is unnecessary when:
>>> + * - The on-disk size is block-aligned (no stale data exists).
>>> + * - The write start is within the block of the old EOF
>>> + * (overwriting, or appending to a block that already contains
>>> + * valid data).
>>> + */
>>> + if (!(disksize & (blocksize - 1)) ||
>>> + pos < round_up(disksize, blocksize))
>>> + return;
>
> Okay these checks are pretty confusing. I was intially thinking that
> i_disksize's block would always be equal to i_ordered_lblk but seems
> like that is not true because ext4_block_zero_eof() uses from=i_size.

Yeah, this is the key point that I was a bit confused about as
well.

>
> So we could have a sequence where
>
> 1. truncate 4k (i_disksize = i_size = 4k)
> 2. write 8k,10k (i_disksize = 4k i_size = 10k, i_ordered_len = 0 (old isisze is block aligned))
> 3. write 16k,18k (i_disksize = 4k i_size = 10k, i_ordered_len = 1, lblk=4)
18k lblk=2, (10k >> 12)
>
> Here we issue ordered IO even though it' probably not needed. Now if
> write 3 finishes first we see disksize as 4k so we don't wait for
> ordered write. Which seems okay since we don't risk any stale data
> exposure. However, this flow is pretty confuing.

Indeed!

>
> Can't we somehow avoid having to issue/set ordered len/lblk in case it
> is not really needed, like only issue it if i_disksize (and not i_size)
> is unaligned. That can simplify some of our check and avoid extra IO
> overhead.
>

I was also planning to explore optimizations on this point next.
However, since the original logic in buffer_head already works this way,
keeping the same logic in the iomap path will not introduce any
additional side effects. To avoid unnecessary waiting, I simply added
the disksize alignment check in ext4_iomap_wb_ordered_wait().

Therefore, I do not plan to implement this optimization in this series.
I can open a separate series later to address this optimization — perhaps
by checking i_disksize in ext4_block_zero_eof() before issuing or adding
ordered I/O, and the buffer_head path might also benefit from optimization.
Meanwhile, to avoid confusion, I can add a TODO comment in this patch.

What do you think?

Cheers,
Yi.