Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path

From: Ojaswin Mujoo

Date: Mon Jun 01 2026 - 14:35:40 EST

On Mon, Jun 01, 2026 at 08:22:09PM +0800, Zhang Yi wrote:
> On 6/1/2026 5:08 PM, Ojaswin Mujoo wrote:
> > On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
> > > On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
> > > > On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
> > > > > From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > > > >
> > > > > In the generic buffered_head I/O path, we rely on the data=order mode to
> > > > > ensure that the zeroed EOF block data is written before updating
> > > > > i_disksize, thus preventing stale data from being exposed.
> > > > >
> > > > > However, the iomap buffered I/O path cannot use this mechanism. Instead,
> > > > > we issue the I/O immediately after performing the zero operation
> > > > > (without synchronous waiting for performance). This can reduce the risk
> > > > > of exposing stale data, but it does not guarantee that the zero data
> > > > > will be flushed to disk before the metadata of i_disksize is updated.
> > > > > The subsequent patches will wait for this I/O to complete before
> > > > > updating i_disksize.
> > > > >
> > > > > Suggested-by: Jan Kara <jack@xxxxxxx>
> > > > > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > > >
> > > > I think we discussed that we may not need to do this [1] but I guess
> > > > you've decided to make the tradeoff of issuing the IO to avoid having to
> > > > wait for bg flush to complete the tail page zeroing
> > > >
>
> I'm glad to hear that, thanks.
>
> > >
> > > Yes. For truncate up and append fallocate, originally i_disksize would
> > > be updated immediately, and the change would be persisted via the
> > > journal within default 5 seconds. But now, if the tail page is not
> > > committed immediately, the update to i_disksize will be delayed by about
> > > 30 seconds, and persistence will be postponed to around 35 seconds. I'm
> > > not sure what impact this change might have — I just don't really want
> > > to introduce it.
> >
> > > For normal append writes, the impact is minimal, unless we call
> > > sync_range to sync the portion of data that extends beyond EOF.
> >
> > Hmm while trying to retain the behavior for falloc and truncate up,
> > we end up changing it for append writes :) But anyways, I understand
> > your reasoning and don't have any strong opinions against it. I'll let
> > Jan pitch in since he had some comments around this.
> >
> > >
> > > In addition, if the zeroed page is not issued here immediately, the
> > > logic will become more complex because we need to more careful about the
> > > order of write-back IOs to prevent deadlock issues caused by mutual
> > > waiting.
> >
> > You mean an endio completion waiting for ordered IO to complete but
> > ordered IO writeback is somehow waiting for this endio completion? Is that
> > actually possible?
>
> Well, after thinking it over more carefully, it seems this is
> impossible, I cannot think of a scenario that could trigger this kind of
> issue. The generic writeback process always executes writeback in folio
> index order, so there would be no situation where a later data I/O
> depends on an earlier ordered I/O. Even if both kinds of IOs are placed
> in the same ioend, there should be no problem. I was confused and
> overthinking it.
>
> From this perspective, if we can accept that truncate up and fallocate
> will have longer persistence time by default(I guess this is
> acceptable), we can avoid writing back zeroed data immediately. To
> achieve this, we only need to consider the case of sync file range. :-)

Yeah, I think during writeback we will have to submit the ordered data
if we are writing back data beyond the i_disksize.

If this is straightforward enough to implement, I think this approach
would be a safer choice cause we will avoid overheads due to small,
random ordered IOs overworking the writeback layer.

>
> Regards,
> Yi.
> > >
> > > > However, I think one side effect might be many threads calling the
> > > > writeback mechanism to issue zero IOs which might not scale well. I
> > > > don't know if it'll be a huge problem though, I guess it's a sort of
> > > > thing we will have to deal with in case we see it in real world
> > > > workloads.
> > > >
> > >
> > > I agree with you. However, I suspect that unless we run some specific
> > > benchmark tests, it should be difficult to encounter a large number of
> > > post-EOF append writes and truncate up operations in real-world usage
> > > scenarios — and I haven't come across such scenarios yet. For
> > > simplicity, I'd like to proceed with this implementation for now. If we
> > > do run into actual problems later, we can consider not issuing I/O
> > > directly here, but instead: 1) find the ordered block in
> > > ext4_sync_file() and perform writeback; 2) ensure writeback ordering
> > > for normal background writeback as well — otherwise, there is a risk of
> > > deadlock (mutual waiting). What do you think?
> >
> > Yes sounds good Yi, we can deal with performance tuning later.
> >
> > Regards,
> > Ojaswin
> >
> > >
> > > Cheers,
> > > Yi.
> > >
> > > > [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
> > > >
> > > > > ---
> > > > > fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
> > > > > 1 file changed, 55 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index 239d387ffaf2..e013aeb03d7b 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
> > > > > zero_written);
> > > > > }
> > > > > +static int ext4_iomap_submit_zero_block(struct inode *inode,
> > > > > + loff_t from, loff_t end)
> > > > > +{
> > > > > + struct address_space *mapping = inode->i_mapping;
> > > > > + struct folio *folio;
> > > > > + bool do_submit = false;
> > > > > +
> > > > > + folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> > > > > + if (IS_ERR(folio))
> > > > > + /* Already writeback and clear? */
> > > > > + return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
> > > > > +
> > > > > + folio_wait_writeback(folio);
> > > > > + WARN_ON_ONCE(folio_test_writeback(folio));
> > > > > +
> > > > > + if (likely(folio_test_dirty(folio)))
> > > > > + do_submit = true;
> > > > > + folio_unlock(folio);
> > > > > + folio_put(folio);
> > > > > +
> > > > > + /* Submit zeroed block. */
> > > > > + if (do_submit)
> > > > > + return filemap_fdatawrite_range(mapping, from, end - 1);
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * Zero out a mapping from file offset 'from' up to the end of the block
> > > > > * which corresponds to 'from' or to the given 'end' inside this block.
> > > > > @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> > > > > if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
> > > > > return 0;
> > > > > - if (length > blocksize - offset)
> > > > > + if (length > blocksize - offset) {
> > > > > length = blocksize - offset;
> > > > > + end = from + length;
> > > > > + }
> > > > > err = ext4_block_zero_range(inode, from, length,
> > > > > &did_zero, &zero_written);
> > > > > @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> > > > > * TODO: In the iomap path, handle this by updating i_disksize to
> > > > > * i_size after the zeroed data has been written back.
> > > > > */
> > > > > - if (ext4_should_order_data(inode) &&
> > > > > - did_zero && zero_written && !IS_DAX(inode)) {
> > > > > - handle_t *handle;
> > > > > + if (did_zero && zero_written && !IS_DAX(inode)) {
> > > > > + if (ext4_should_order_data(inode)) {
> > > > > + handle_t *handle;
> > > > > - handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> > > > > - if (IS_ERR(handle))
> > > > > - return PTR_ERR(handle);
> > > > > + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> > > > > + if (IS_ERR(handle))
> > > > > + return PTR_ERR(handle);
> > > > > - err = ext4_jbd2_inode_add_write(handle, inode, from, length);
> > > > > - ext4_journal_stop(handle);
> > > > > - if (err)
> > > > > - return err;
> > > > > + err = ext4_jbd2_inode_add_write(handle, inode, from,
> > > > > + length);
> > > > > + ext4_journal_stop(handle);
> > > > > + if (err)
> > > > > + return err;
> > > > > + /*
> > > > > + * inodes using the iomap buffered I/O path do not use the
> > > > > + * data=ordered mode. We submit zeroed range directly here.
> > > > > + * Do not wait for I/O completion for performance.
> > > > > + *
> > > > > + * TODO: Any operation that extends i_disksize (including
> > > > > + * append write end io past the zeroed boundary, truncate up,
> > > > > + * and append fallocate) must wait for the relevant I/O to
> > > > > + * complete before updating i_disksize.
> > > > > + */
> > > > > + } else if (ext4_inode_buffered_iomap(inode)) {
> > > > > + err = ext4_iomap_submit_zero_block(inode, from, end);
> > > > > + if (err)
> > > > > + return err;
> > > > > + }
> > > > > }
> > > > > return 0;
> > > > > --
> > > > > 2.52.0
> > > > >
> > >
>