Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Ojaswin Mujoo
Date: Tue Jun 02 2026 - 06:41:26 EST
On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>
> Introduce two new iomap_ops instances for ext4 buffered writes:
>
> - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> ext4_da_map_blocks() to map delalloc extents.
> - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> ext4_iomap_get_blocks() to directly allocate blocks.
>
> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> validity.
>
> Key changes and considerations:
>
> - Unwritten extents for new blocks (dioread_nolock always on)
> Since data=ordered mode is not used to prevent stale data exposure in
> the non-delayed allocation path, new blocks are always allocated as
> unwritten extents.
>
> - Short write and write failure handling
> a. Delalloc path: On short write or failure, the stale delalloc range
> must be dropped and its space reservation released. Otherwise, a
> clean folio may cover leftover delalloc extents, causing
> inaccurate space reservation accounting.
> b. Non-delalloc path: No cleanup of allocated blocks is needed on
> short write.
>
> - Lock ordering reversal
> The folio lock and transaction start ordering is reversed compared to
> the buffer_head buffered write path. To handle this, the journal
> handle must be stopped in iomap_begin() callbacks. The lock ordering
> documentation in super.c has been updated accordingly.
>
> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
I went through this again and after our discussion the changes looks
okay. Just a small quesiton below but otherwise feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
> ---
> fs/ext4/ext4.h | 4 ++
> fs/ext4/file.c | 20 +++++-
> fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/super.c | 10 ++-
> 4 files changed, 192 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1e27d73d7427..4832e7f7db82 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> struct buffer_head *bh);
> void ext4_set_inode_mapping_order(struct inode *inode);
> +int ext4_nonda_switch(struct super_block *sb);
> #define FALL_BACK_TO_NONDELALLOC 1
> #define CONVERT_INLINE_DATA 2
<snip>
> + ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> + return 0;
> +}
> +
> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> + loff_t offset, loff_t length, unsigned int flags,
> + struct iomap *iomap, struct iomap *srcmap)
> +{
> + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> + iomap, srcmap, false);
> +}
> +
> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> + loff_t offset, loff_t length, unsigned int flags,
> + struct iomap *iomap, struct iomap *srcmap)
> +{
> + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> + iomap, srcmap, true);
> +}
> +
> +/*
> + * On write failure, drop the stale delayed allocation range and release
> + * its reserved space for both start and end blocks. Otherwise, we may
> + * leave a range of delayed extents covered by a clean folio, which can
> + * result in inaccurate space reservation accounting.
> + */
> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> + loff_t length, struct iomap *iomap)
> +{
> + down_write(&EXT4_I(inode)->i_data_sem);
> + ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> + DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> + up_write(&EXT4_I(inode)->i_data_sem);
> +}
> +
> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> + loff_t length, ssize_t written,
> + unsigned int flags,
> + struct iomap *iomap)
> +{
> + loff_t start_byte, end_byte;
> +
> + /* If we didn't reserve the blocks, we're not allowed to punch them. */
> + if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
Will we ever get IOMAP_F_NEW here? I think the da_write_begin() call
either creates a new IOMAP_DELALLOC extent or finds older ones which
won't have EXT4_MAP_NEW set
> + return 0;
> +
> + /* Nothing to do if we've written the entire delalloc extent */
> + start_byte = iomap_last_written_block(inode, offset, written);
> + end_byte = round_up(offset + length, i_blocksize(inode));
> + if (start_byte >= end_byte)
> + return 0;
> +
> + filemap_invalidate_lock(inode->i_mapping);
> + iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> + iomap, ext4_iomap_punch_delalloc);
> + filemap_invalidate_unlock(inode->i_mapping);
> + return 0;
> +}