Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap

From: Ojaswin Mujoo

Date: Wed Jun 03 2026 - 07:30:45 EST

On Wed, Jun 03, 2026 at 10:56:34AM +0800, Zhang Yi wrote:
> On 6/2/2026 6:26 PM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>
> >> Introduce two new iomap_ops instances for ext4 buffered writes:
> >>
> >> - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> >> ext4_da_map_blocks() to map delalloc extents.
> >> - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> >> ext4_iomap_get_blocks() to directly allocate blocks.
> >>
> >> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> >> validity.
> >>
> >> Key changes and considerations:
> >>
> >> - Unwritten extents for new blocks (dioread_nolock always on)
> >> Since data=ordered mode is not used to prevent stale data exposure in
> >> the non-delayed allocation path, new blocks are always allocated as
> >> unwritten extents.
> >>
> >> - Short write and write failure handling
> >> a. Delalloc path: On short write or failure, the stale delalloc range
> >> must be dropped and its space reservation released. Otherwise, a
> >> clean folio may cover leftover delalloc extents, causing
> >> inaccurate space reservation accounting.
> >> b. Non-delalloc path: No cleanup of allocated blocks is needed on
> >> short write.
> >>
> >> - Lock ordering reversal
> >> The folio lock and transaction start ordering is reversed compared to
> >> the buffer_head buffered write path. To handle this, the journal
> >> handle must be stopped in iomap_begin() callbacks. The lock ordering
> >> documentation in super.c has been updated accordingly.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >
> > I went through this again and after our discussion the changes looks
> > okay. Just a small quesiton below but otherwise feel free to add:
> >
> > Reviewed-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
>
> Thank you a lot for your careful review!
>
> >
> >> ---
> >> fs/ext4/ext4.h | 4 ++
> >> fs/ext4/file.c | 20 +++++-
> >> fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> >> fs/ext4/super.c | 10 ++-
> >> 4 files changed, 192 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 1e27d73d7427..4832e7f7db82 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> >> int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> >> struct buffer_head *bh);
> >> void ext4_set_inode_mapping_order(struct inode *inode);
> >> +int ext4_nonda_switch(struct super_block *sb);
> >> #define FALL_BACK_TO_NONDELALLOC 1
> >> #define CONVERT_INLINE_DATA 2
> >
> > <snip>
> >
> >> + ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >> + return 0;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> >> + loff_t offset, loff_t length, unsigned int flags,
> >> + struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> + iomap, srcmap, false);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> >> + loff_t offset, loff_t length, unsigned int flags,
> >> + struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> + iomap, srcmap, true);
> >> +}
> >> +
> >> +/*
> >> + * On write failure, drop the stale delayed allocation range and release
> >> + * its reserved space for both start and end blocks. Otherwise, we may
> >> + * leave a range of delayed extents covered by a clean folio, which can
> >> + * result in inaccurate space reservation accounting.
> >> + */
> >> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> >> + loff_t length, struct iomap *iomap)
> >> +{
> >> + down_write(&EXT4_I(inode)->i_data_sem);
> >> + ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> >> + DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> >> + up_write(&EXT4_I(inode)->i_data_sem);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >> + loff_t length, ssize_t written,
> >> + unsigned int flags,
> >> + struct iomap *iomap)
> >> +{
> >> + loff_t start_byte, end_byte;
> >> +
> >> + /* If we didn't reserve the blocks, we're not allowed to punch them. */
> >> + if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> >
> > Will we ever get IOMAP_F_NEW here? I think the da_write_begin() call
> > either creates a new IOMAP_DELALLOC extent or finds older ones which
> > won't have EXT4_MAP_NEW set
> >
>
> Oops. This is a bug！ In ext4_da_map_blocks(), when allocating a new
> delalloc extent, the EXT4_MAP_NEW flag should be set. If this flag is
> not set, then when a short write occurs, we cannot distinguish whether
> an extent is a pre-existing delalloc extent or a newly allocated one.
> This prevents the subsequent truncate operation from being executed,
> leaving the newly allocated delalloc extent behind. I will fix this in
> next iteration.

Yes thats true I misread the condition and missed that we will always
exit early here :/

Regards,
ojaswin

>
> Thanks,
> Yi.
>