Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock

From: Dmitry Ilvokhin

Date: Wed Jun 03 2026 - 10:23:45 EST

On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> On Thu, May 14, 2026 at 12:34:55PM +0000, Dmitry Ilvokhin wrote:
>
> > Baseline, in best case scenario of least number of executed
> > instructions.
> >
> > 3e0: endbr64 ; 4 bytes (always executed)
> > 3e4: movb $0x0,(%rdi) ; 3 bytes (unlock,
> > ; always executed)
> > 3e7: decl %gs:__preempt_count ; 7 bytes (always executed)
> > 3ee: je 3f5 ; 2 bytes (always executed)
> > 3f0: jmp __x86_return_thunk ; 5 bytes (executed if above
> > ; je is not taken)
> > ; rest is not executed
> > 3f5: call __SCT__preempt_schedule ; 5 bytes
> > 3fa: jmp __x86_return_thunk ; 5 bytes
> >
> > Tracepoint (again same case of least number of executed instructions).
> >
> > bc0: endbr64 ; 4 bytes (always executed)
> > bc4: xchg %ax,%ax ; 2 bytes (always executed, this is an
> > ; only addition on the execution path).
> > bc6: movb $0x0,(%rdi) ; 3 bytes (unlock, always executed)
> > bc9: decl %gs:__preempt_count ; 7 bytes (always executed)
> > bd0: je bde ; 2 bytes (always executed)
> > bd2: jmp __x86_return_thunk ; 5 bytes (executed if above
> > ; je is not taken)
> > ; rest is not executed
> > bd7: call queued_spin_release_traced ; 5 bytes
> > bdc: jmp bc9 ; 2 bytes
> > bde: call __SCT__preempt_schedule ; 5 bytes
> > be3: jmp __x86_return_thunk ; 5 bytes
> >
>
> So I've been playing with this a bit, and it is all really sad.
>
> Now, since pretty much everybody+dog will have PARAVIRT_SPINLOCK=y, the
> 'best' solution would be changing that paravirt call with a
> static_call(), that actually shrinks the code by 1 byte.
>
> And then this tracepoint nonsense can simply use a different unlock
> function, just like paravirt.
>
> 0000 00000000000001d0 <_raw_spin_unlock>:
> 0000 1d0: f3 0f 1e fa endbr64
> 0004 1d4: ff 15 00 00 00 00 call *0x0(%rip) # 1da <_raw_spin_unlock+0xa> 1d6: R_X86_64_PC32 pv_ops_lock+0x4
> 000a 1da: 65 ff 0d 00 00 00 00 decl %gs:0x0(%rip) # 1e1 <_raw_spin_unlock+0x11> 1dd: R_X86_64_PC32 __preempt_count-0x4
> 0011 1e1: 74 06 je 1e9 <_raw_spin_unlock+0x19>
> 0013 1e3: 2e e9 00 00 00 00 cs jmp 1e9 <_raw_spin_unlock+0x19> 1e5: R_X86_64_PLT32 __x86_return_thunk-0x4
> 0019 1e9: e8 00 00 00 00 call 1ee <_raw_spin_unlock+0x1e> 1ea: R_X86_64_PLT32 __SCT__preempt_schedule-0x4
> 001e 1ee: 2e e9 00 00 00 00 cs jmp 1f4 <_raw_spin_unlock+0x24> 1f0: R_X86_64_PLT32 __x86_return_thunk-0x4
>
>
> 0000 00000000000001d0 <_raw_spin_unlock>:
> 0000 1d0: f3 0f 1e fa endbr64
> 0004 1d4: e8 00 00 00 00 call 1d9 <_raw_spin_unlock+0x9> 1d5: R_X86_64_PLT32 __SCT__queued_spin_unlock-0x4
> 0009 1d9: 65 ff 0d 00 00 00 00 decl %gs:0x0(%rip) # 1e0 <_raw_spin_unlock+0x10> 1dc: R_X86_64_PC32 __preempt_count-0x4
> 0010 1e0: 74 06 je 1e8 <_raw_spin_unlock+0x18>
> 0012 1e2: 2e e9 00 00 00 00 cs jmp 1e8 <_raw_spin_unlock+0x18> 1e4: R_X86_64_PLT32 __x86_return_thunk-0x4
> 0018 1e8: e8 00 00 00 00 call 1ed <_raw_spin_unlock+0x1d> 1e9: R_X86_64_PLT32 __SCT__preempt_schedule-0x4
> 001d 1ed: 2e e9 00 00 00 00 cs jmp 1f3 <_raw_spin_unlock+0x23> 1ef: R_X86_64_PLT32 __x86_return_thunk-0x4
>
>
> Something a little like so, which is completely untested, except to
> build kernel/locking/spinlock.o (with clang-23).

Thanks a lot for taking a look, Peter.

I like the static_call idea. It's truly zero cost on x86 (and, as you
note, even a byte smaller). The one caveat is that it relies on
HAVE_STATIC_CALL_INLINE to stay free.

So my plan would be: static_call where HAVE_STATIC_CALL_INLINE is
available (x86), and a static branch fallback elsewhere, gated behind a
default-off config so it imposes nothing on arches/kernels that don't
opt in. I'm mostly interested in x86, but would like arm64 to work too,
which would use the fallback.

Concretely:

1. Split the sleepable-lock patches out and send them separately.
They're independent of the static call work and look far less
controversial.

2. Convert the paravirt spinlock unlock to a static_call, as the
foundation for the unlock tracepoint. I'm happy to take a stab at it.
Let me know if you'd rather do it yourself.

3. Build the unlock tracepoint on top: static_call where it's cheap,
config-gated static_branch fallback where it isn't.

Does this plan sound reasonable to you?

>
> Also, I think someone should go do some performance runs with
> ARCH_INLINE_SPIN_* set for x86 just like for s390.

That's a good point, I'll run benchmarks and report back with the
results.