Re: [patch v2 00/11] futex: Address the robust futex unlock race for real

From: Thomas Gleixner

Date: Fri Mar 27 2026 - 06:13:27 EST

On Fri, Mar 27 2026 at 00:42, André Almeida wrote:
> Em 26/03/2026 19:08, Rich Felker escreveu:
>> On Thu, Mar 26, 2026 at 10:59:20PM +0100, Thomas Gleixner wrote:
>>> On Fri, Mar 20 2026 at 00:24, Thomas Gleixner wrote:
>>>> If the functionality itself is agreed on we only need to agree on the names
>>>> and signatures of the functions exposed through the VDSO before we set them
>>>> in stone. That will hopefully not take another 15 years :)
>>>
>>> Have the libc folks any further opinion on the syscall and the vDSO part
>>> before I prepare v3?
>>
>> This whole conversation has been way too much for me to keep up with,
>> so I'm not sure where it's at right now.
>>
>> From musl's perspective, the way we make robust mutex unlocking safe
>> right now is by inhibiting munmap/mremap/MAP_FIXED and
>> pthread_mutex_destroy while there are any in-flight robust unlocks. It
>> will be nice to be able to conditionally stop doing that if vdso is
>> available, but I can't see using a fallback that requires a syscall,
>> as that would just be a lot more expensive than what we're doing right
>> now and still not work on older kernels. So I think the only part
>> we're interested in is the fully-userspace approach in vdso.
>>
>
> You just need the syscall for the contented case (where you would need a
> syscall anyway for a FUTEX_WAKE).
>
> As Thomas wrote in patch 09/11:
>
> The resulting code sequence for user space is:
>
> if (__vdso_futex_robust_list$SZ_try_unlock(lock, tid, &pending_op) !=
> tid)
> err = sys_futex($OP | FUTEX_ROBUST_UNLOCK,....);
>
> Both the VDSO unlock and the kernel side unlock ensure that the
> pending_op pointer is always cleared when the lock becomes unlocked.
>
>
> So you call the vDSO first. If it fails, it means that the lock is
> contented and you need to call futex(). It will wake a waiter, release
> the lock and clean list_op_pending.

See also the V1 cover letter which has a full deep dive:

https://lore.kernel.org/20260316162316.356674433@xxxxxxxxxx

TLDR:

The problem can be split into two issues:

1) Contended unlock

2) Uncontended unlock

#1 is solved by moving the unlock into the kernel instead of unlocking
first and then invoking the syscall to wake waiters. The syscall
takes the list_pending_op pointer as an argument and after unlocking,
i.e. *lock = 0, it clears the list_pending_op pointer

For this to work, it needs to use try_cmpxchg() like PI unlock does.

#2 The race is between the succesful try_cmpxchg() and the clearing of
the list_pending_op pointer

That's where the VDSO comes into play. Instead of having the
try_cmpxchg() in the library code the library invokes the VDSO
provided variant. That allows the kernel to check in the signal
delivery path whether a successful unlock requires a helping hand to
clear the list pending op pointer. If the interrupted IP is in the
critical section _and_ the try_cmpxchg() succeeded then the kernel
clears the pointer.

In x86 ASM:

0000000000001590 <__vdso_futex_robust_list64_try_unlock@@LINUX_2.6>:
1590: mov %esi,%eax
1592: xor %ecx,%ecx
1594: lock cmpxchg %ecx,(%rdi) // Result goes into ZF
1598: jne 159d <- CS start
159a: mov %rcx,(%rdx) // Clear list_pending_op
159d: ret <- CS end
159e: xchg %ax,%ax

So if the kernel observes

IP >= CS start && IP < CS end

then it checks the ZF flag in pt_regs and if set it clears the
list_pending op.

Obviously #1 depends on #2 to close all holes.

Thanks,

tglx