Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()

From: Ankur Arora

Date: Mon Mar 16 2026 - 18:10:04 EST



Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> writes:

> On Sun, 15 Mar 2026 18:36:39 -0700 Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>
>> Hi,
>>
>> This series adds waited variants of the smp_cond_load() primitives:
>> smp_cond_load_relaxed_timeout(), and smp_cond_load_acquire_timeout().
>>
>> ...
>>
>
> How are we to determine that this change is successful, useful, etc?

Good point. So this series was split off from this one here:
https://lore.kernel.org/lkml/20250218213337.377987-1-ankur.a.arora@xxxxxxxxxx/

The series enables ARCH_HAS_CPU_RELAX on arm64 which should allow
relatively cheap polling in idle on arm64.
However, it does need a few more patches from the series above to do that.

> Reduced CPU consumption? Reduced energy usage? Improved latencies?

With the additional patches this should improve wakeup latency:

I ran the sched-pipe test with processes on VCPUs 4 and 5 with
kvm-arm.wfi_trap_policy=notrap.

# perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
perf bench sched pipe -l 1000000 -c 4

# No haltpoll (and, no TIF_POLLING_NRFLAG):

Performance counter stats for 'CPU(s) 4,5' (5 runs):

25,229.57 msec task-clock # 2.000 CPUs utilized ( +- 7.75% )
45,821,250,284 cycles # 1.816 GHz ( +- 10.07% )
26,557,496,665 instructions # 0.58 insn per cycle ( +- 0.21% )
0 sched:sched_wake_idle_without_ipi # 0.000 /sec

12.615 +- 0.977 seconds time elapsed ( +- 7.75% )


# Haltpoll:

Performance counter stats for 'CPU(s) 4,5' (5 runs):

15,131.58 msec task-clock # 2.000 CPUs utilized ( +- 10.00% )
34,158,188,839 cycles # 2.257 GHz ( +- 6.91% )
20,824,950,916 instructions # 0.61 insn per cycle ( +- 0.09% )
1,983,822 sched:sched_wake_idle_without_ipi # 131.105 K/sec ( +- 0.78% )

7.566 +- 0.756 seconds time elapsed ( +- 10.00% )

We get a decent boost just because we are executing ~20% fewer
instructions. Not sure how the cpu frequency scaling works in a VM but
we also run at a higher frequency.

(That specifically applies to guests but that series also adds enables this
with acpi-idle for baremetal.)

(From: https://lore.kernel.org/lkml/877c9zhk68.fsf@xxxxxxxxxx/)

>> Finally update poll_idle() and resilient queued spinlocks to use them.
>
> Have you identified other suitable sites for conversion?

Haven't found other places in the core kernel where this could be used.
I think one reason is that the typical kernel wait is unbounded.

There are some in drivers/ that have this pattern. For instance I think
this in drivers/iommu/arm/arm-smmu-v3 could be converted:
__arm_smmu_cmdq_poll_until_msi().

However, as David Laight pointed out in this thread
(https://lore.kernel.org/lkml/20260214113122.70627a8b@pumpkin/)
that this would be fine so long as the polling is on memory, but would
need some work to handle MMIO.

>> Documentation/atomic_t.txt | 14 +++--
>> arch/arm64/Kconfig | 3 +
>> arch/arm64/include/asm/barrier.h | 23 +++++++
>> arch/arm64/include/asm/cmpxchg.h | 62 +++++++++++++++----
>> arch/arm64/include/asm/delay-const.h | 27 +++++++++
>> arch/arm64/include/asm/rqspinlock.h | 85 --------------------------
>> arch/arm64/lib/delay.c | 15 ++---
>> drivers/cpuidle/poll_state.c | 21 +------
>> drivers/soc/qcom/rpmh-rsc.c | 8 +--
>> include/asm-generic/barrier.h | 90 ++++++++++++++++++++++++++++
>> include/linux/atomic.h | 10 ++++
>> include/linux/atomic/atomic-long.h | 18 +++---
>> include/linux/sched/idle.h | 29 +++++++++
>> kernel/bpf/rqspinlock.c | 77 +++++++++++++++---------
>> scripts/atomic/gen-atomic-long.sh | 16 +++--
>> 15 files changed, 320 insertions(+), 178 deletions(-)
>> create mode 100644 arch/arm64/include/asm/delay-const.h
>
> Some sort of testing in lib/tests/ would be appropriate and useful.

Makes sense. Will add.

Thanks
--
ankur