Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap

From: Ojaswin Mujoo

Date: Tue Jun 02 2026 - 06:07:58 EST

On Fri, May 29, 2026 at 05:13:55PM +0800, Zhang Yi wrote:
> Hi, Ojaswin!
>
> On 5/27/2026 1:10 AM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>
> >> Introduce two new iomap_ops instances for ext4 buffered writes:
> >>
> >> - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> >> ext4_da_map_blocks() to map delalloc extents.
> >> - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> >> ext4_iomap_get_blocks() to directly allocate blocks.
> >>
> >> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> >> validity.
> >>
> >> Key changes and considerations:
> >>
> >> - Unwritten extents for new blocks (dioread_nolock always on)
> >> Since data=ordered mode is not used to prevent stale data exposure in
> >> the non-delayed allocation path, new blocks are always allocated as
> >> unwritten extents.
> >
> > Okay makes sense.
> >
> >>
> >> - Short write and write failure handling
> >> a. Delalloc path: On short write or failure, the stale delalloc range
> >> must be dropped and its space reservation released. Otherwise, a
> >> clean folio may cover leftover delalloc extents, causing
> >> inaccurate space reservation accounting.
> >
> > Hmm, okay so in the usual buffer head path, seems like during a short
> > write we still zero the new buffers we couldn't write and keep it dirty
> > (folio_zero_new_buffers()). This way they are still written back and
> > the delalloc reservations are used up.
> >
>
> In fact, in the normal buffer head path, writeback does not consume
> delalloc reservations. Instead, the reservations are retained until the
> inode is released or the area is written again using delalloc. This is
> because i_size is not updated during short writes. Therefore, when a
> zeroed dirty folio is written back, no block mapping is created for it.
> For details, please see the lblk >= blocks judgment in
> mpage_process_page_bufs().

Oh okay I see, I'm not very clear on the code path but what about a case
where i_size is beyond the short write range.

>
> This will not lead to duplicate space statistics, because
> ext4_da_map_blocks() only reserves space when inserting a new delalloc
> extent. Therefore, this does not pose a serious issue. However, It may
> cause some temporary and minor space leaks. Nevertheless, I think it
> would be better if delalloc extents could be released for the buffer
> head path when short writes occur.

Yes true, ideally it would be more intuitive if we cancelled the
reservations in short write.

Regards,
ojaswin

>
> > However in iomap we don't mark the range that we couldnt write as dirty
> > so we need to make sure we clear up the stale delalloc mappings. Is this
> > correct?
> >
> Yeah.
>
> Thanks,
> Yi.
>
> > Regards,
> > Ojaswin
> >
> >> b. Non-delalloc path: No cleanup of allocated blocks is needed on
> >> short write.
> >>
> >> - Lock ordering reversal
> >> The folio lock and transaction start ordering is reversed compared to
> >> the buffer_head buffered write path. To handle this, the journal
> >> handle must be stopped in iomap_begin() callbacks. The lock ordering
> >> documentation in super.c has been updated accordingly.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >> ---
> >> fs/ext4/ext4.h | 4 ++
> >> fs/ext4/file.c | 20 +++++-
> >> fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> >> fs/ext4/super.c | 10 ++-
> >> 4 files changed, 192 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 1e27d73d7427..4832e7f7db82 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> >> int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> >> struct buffer_head *bh);
> >> void ext4_set_inode_mapping_order(struct inode *inode);
> >> +int ext4_nonda_switch(struct super_block *sb);
> >> #define FALL_BACK_TO_NONDELALLOC 1
> >> #define CONVERT_INLINE_DATA 2
> >>
> >> @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> >>
> >> extern const struct iomap_ops ext4_iomap_ops;
> >> extern const struct iomap_ops ext4_iomap_report_ops;
> >> +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> >> +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> >> +extern const struct iomap_write_ops ext4_iomap_write_ops;
> >>
> >> static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> >> {
> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> >> index eb1a323962b1..7f9bfbbc4a4e 100644
> >> --- a/fs/ext4/file.c
> >> +++ b/fs/ext4/file.c
> >> @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >> return count;
> >> }
> >>
> >> +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> >> + struct iov_iter *from)
> >> +{
> >> + struct inode *inode = file_inode(iocb->ki_filp);
> >> + const struct iomap_ops *iomap_ops;
> >> +
> >> + if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> >> + iomap_ops = &ext4_iomap_buffered_da_write_ops;
> >> + else
> >> + iomap_ops = &ext4_iomap_buffered_write_ops;
> >> +
> >> + return iomap_file_buffered_write(iocb, from, iomap_ops,
> >> + &ext4_iomap_write_ops, NULL);
> >> +}
> >> +
> >> static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >> struct iov_iter *from)
> >> {
> >> @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >> if (ret <= 0)
> >> goto out;
> >>
> >> - ret = generic_perform_write(iocb, from);
> >> + if (ext4_inode_buffered_iomap(inode))
> >> + ret = ext4_iomap_buffered_write(iocb, from);
> >> + else
> >> + ret = generic_perform_write(iocb, from);
> >>
> >> out:
> >> inode_unlock(inode);
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 39577a6b65b9..1ae7d3f4a1c8 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
> >> return ret;
> >> }
> >>
> >> -static int ext4_nonda_switch(struct super_block *sb)
> >> +int ext4_nonda_switch(struct super_block *sb)
> >> {
> >> s64 free_clusters, dirty_clusters;
> >> struct ext4_sb_info *sbi = EXT4_SB(sb);
> >> @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> >> return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
> >> }
> >>
> >> +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> >> +{
> >> + return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> >> +}
> >> +
> >> +const struct iomap_write_ops ext4_iomap_write_ops = {
> >> + .iomap_valid = ext4_iomap_valid,
> >> +};
> >> +
> >> static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >> struct ext4_map_blocks *map, loff_t offset,
> >> loff_t length, unsigned int flags)
> >> @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >> !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> >> iomap->flags |= IOMAP_F_MERGED;
> >>
> >> + iomap->validity_cookie = map->m_seq;
> >> +
> >> /*
> >> * Flags passed to ext4_map_blocks() for direct I/O writes can result
> >> * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> >> @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
> >> .iomap_begin = ext4_iomap_begin_report,
> >> };
> >>
> >> +/* Map blocks */
> >> +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> >> +
> >> static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> >> - loff_t length, struct ext4_map_blocks *map)
> >> + loff_t length, ext4_get_blocks_t get_blocks,
> >> + struct ext4_map_blocks *map)
> >> {
> >> u8 blkbits = inode->i_blkbits;
> >>
> >> @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> >> map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> >> EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
> >>
> >> + if (get_blocks)
> >> + return get_blocks(inode, map);
> >> +
> >> return ext4_map_blocks(NULL, inode, map, 0);
> >> }
> >>
> >> @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> >> if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> >> return -ERANGE;
> >>
> >> - ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> >> + ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> >> if (ret < 0)
> >> return ret;
> >>
> >> @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> >> return 0;
> >> }
> >>
> >> +static int ext4_iomap_get_blocks(struct inode *inode,
> >> + struct ext4_map_blocks *map)
> >> +{
> >> + loff_t i_size = i_size_read(inode);
> >> + handle_t *handle;
> >> + int ret;
> >> +
> >> + /*
> >> + * Check if the blocks have already been allocated, this could
> >> + * avoid initiating a new journal transaction and return the
> >> + * mapping information directly.
> >> + */
> >> + if ((map->m_lblk + map->m_len) <=
> >> + round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> >> + ret = ext4_map_blocks(NULL, inode, map, 0);
> >> + if (ret < 0)
> >> + return ret;
> >> + if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> >> + EXT4_MAP_DELAYED))
> >> + return 0;
> >> + }
> >> +
> >> + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> >> + ext4_chunk_trans_blocks(inode, map->m_len));
> >> + if (IS_ERR(handle))
> >> + return PTR_ERR(handle);
> >> +
> >> + ret = ext4_map_blocks(handle, inode, map,
> >> + EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> >> + /*
> >> + * Stop handle here following the lock ordering of the folio lock
> >> + * and the transaction start.
> >> + */
> >> + ext4_journal_stop(handle);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> >> + loff_t offset, loff_t length, unsigned int flags,
> >> + struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> >> +{
> >> + int ret, retries = 0;
> >> + struct ext4_map_blocks map;
> >> + ext4_get_blocks_t *get_blocks;
> >> +
> >> + ret = ext4_emergency_state(inode->i_sb);
> >> + if (unlikely(ret))
> >> + return ret;
> >> +
> >> + /* Inline data and non-extent are not supported. */
> >> + if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> >> + return -ERANGE;
> >> + if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> >> + return -EINVAL;
> >> + if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> >> + return -EINVAL;
> >> +
> >> + if (delalloc)
> >> + get_blocks = ext4_da_map_blocks;
> >> + else
> >> + get_blocks = ext4_iomap_get_blocks;
> >> +retry:
> >> + ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> >> + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> >> + goto retry;
> >> + if (ret < 0)
> >> + return ret;
> >> +
> >> + ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >> + return 0;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> >> + loff_t offset, loff_t length, unsigned int flags,
> >> + struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> + iomap, srcmap, false);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> >> + loff_t offset, loff_t length, unsigned int flags,
> >> + struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> + iomap, srcmap, true);
> >> +}
> >> +
> >> +/*
> >> + * On write failure, drop the stale delayed allocation range and release
> >> + * its reserved space for both start and end blocks. Otherwise, we may
> >> + * leave a range of delayed extents covered by a clean folio, which can
> >> + * result in inaccurate space reservation accounting.
> >> + */
> >> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> >> + loff_t length, struct iomap *iomap)
> >> +{
> >> + down_write(&EXT4_I(inode)->i_data_sem);
> >> + ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> >> + DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> >> + up_write(&EXT4_I(inode)->i_data_sem);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >> + loff_t length, ssize_t written,
> >> + unsigned int flags,
> >> + struct iomap *iomap)
> >> +{
> >> + loff_t start_byte, end_byte;
> >> +
> >> + /* If we didn't reserve the blocks, we're not allowed to punch them. */
> >> + if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> >> + return 0;
> >> +
> >> + /* Nothing to do if we've written the entire delalloc extent */
> >> + start_byte = iomap_last_written_block(inode, offset, written);
> >> + end_byte = round_up(offset + length, i_blocksize(inode));
> >> + if (start_byte >= end_byte)
> >> + return 0;
> >> +
> >> + filemap_invalidate_lock(inode->i_mapping);
> >> + iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> >> + iomap, ext4_iomap_punch_delalloc);
> >> + filemap_invalidate_unlock(inode->i_mapping);
> >> + return 0;
> >> +}
> >> +
> >> +/*
> >> + * Since we always allocate unwritten extents, there is no need for
> >> + * iomap_end to clean up allocated blocks on a short write.
> >> + */
> >> +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> >> + .iomap_begin = ext4_iomap_buffered_write_begin,
> >> +};
> >> +
> >> +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> >> + .iomap_begin = ext4_iomap_buffered_da_write_begin,
> >> + .iomap_end = ext4_iomap_buffered_da_write_end,
> >> +};
> >> +
> >> const struct iomap_ops ext4_iomap_buffered_read_ops = {
> >> .iomap_begin = ext4_iomap_buffered_read_begin,
> >> };
> >> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> >> index 6a77db4d3124..9bc294b769db 100644
> >> --- a/fs/ext4/super.c
> >> +++ b/fs/ext4/super.c
> >> @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
> >> * -> page lock -> i_data_sem (rw)
> >> *
> >> * buffered write path:
> >> - * sb_start_write -> i_mutex -> mmap_lock
> >> - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> >> - * i_data_sem (rw)
> >> + * sb_start_write -> i_rwsem (w) -> mmap_lock
> >> + * - buffer_head path:
> >> + * sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> >> + * i_data_sem (rw)
> >> + * - iomap path:
> >> + * sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> >> + * sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
> >> *
> >> * truncate:
> >> * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> >> --
> >> 2.52.0
> >>
>