Re: [PATCH v2 1/2] ext4: avoid RWM atomic in EXT4_MB_GRP_TEST_AND_SET_READ

From: Andreas Dilger

Date: Wed May 27 2026 - 15:47:03 EST

On May 27, 2026, at 03:03, Bohdan Trach <bohdan.trach@xxxxxxxxxxxxxxx> wrote:
>
> EXT4_MB_GRP_TEST_AND_SET_READ uses test_and_set_bit function which
> issues an atomic write. This can cause high overhead due to cache
> contention when multiple threads iterate over groups in a tight loop,
> as is the case for ext4_mb_prefetch(). We have seen this to be a
> problem for Kunpeng 920b CPUs which uses a single ARM LSE instruction
> for this purpose.
>
> Avoid this unconditional atomic write by testing the bit first without
> changing its value. This is OK for this use case as this bit is never
> unset.
>
> This change significantly reduces costs of fallocate() operations which
> trigger linear group scans on large multicore machines where
> test_and_set_bit issues an atomic write operation unconditionally.
>
> Signed-off-by: Bohdan Trach <bohdan.trach@xxxxxxxxxxxxxxx>

Thanks for the patch. Definitely the benchmarks in the 0/2 email show
significant gains for the Kunpeng system, and reducing contention makes sense
as core counts increase and the likely case is that the bit is already set.

That said, I wonder if this should (also/instead) be put into test_and_set_bit()
itself, or add test_and_unlikely_set_bit() or test_and_rarely_set_bit()
(or similar) optimized for the case where the bit is likely to already be set.

I see in your benchmarking that there is not "apples-to-apples" comparisons for
ARM(Kunpeng) vs. AMD on the same storage. The storage hardware and space usage
is different for each test run, and the ARM numbers show only marginal gains and
more negative than positive results at all thread counts:

> Benchmark on an existing file system for AMD 9654 (15T FS, 6% space
> used), kernel 7.1-rc3. This shows the performance impact on a mostly
> free file system.
> | thr. | base | patched | improv. |
> | | perf | perf | |
> |------|-------|---------|------------|
> | 1 | 30901 | 31191 | +0.9384810 |
> | 2 | 50874 | 50504 | -0.7272870 |
> | 4 | 66068 | 64108 | -2.9666404 |
> | 8 | 63963 | 61927 | -3.1830902 |
> | 16 | 47809 | 47044 | -1.6001171 |
> | 32 | 42441 | 42326 | -0.2709644 |
> | 64 | 39773 | 39929 | +0.3922259 |
> | 128 | 37065 | 36413 | -1.7590719 |

The performance reduction might be caused by the now double memory access on
AMD that is only adding overhead on that CPU implementation? It would be useful
to see the testing on Kunpeng vs. AMD/Intel on the same storage device/usage.

That would tell us if it is more appropriate to optimize this in the aarch64
test_and_set_bit() rather than in ext4.

Cheers, Andreas

> ---
> fs/ext4/ext4.h | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 56b82d4a15d7..f8eacf1375f8 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3551,7 +3551,13 @@ struct ext4_group_info {
> #define EXT4_MB_GRP_CLEAR_TRIMMED(grp) \
> (clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
> #define EXT4_MB_GRP_TEST_AND_SET_READ(grp) \
> - (test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
> + (ext4_mb_grp_test_and_set_read((grp)))
> +
> +static inline int ext4_mb_grp_test_and_set_read(struct ext4_group_info *grp)
> +{
> + return (test_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state) ||
> + test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state));
> +}
>
> #define EXT4_MAX_CONTENTION 8
> #define EXT4_CONTENTION_THRESHOLD 2
> --
> 2.43.0
>

Cheers, Andreas