Re: Possible PREEMPT_RT live-lock / priority-inversion between FUTEX_CMP_REQUEUE_PI and FUTEX_WAIT_REQUEUE_PI

From: Sebastian Andrzej Siewior

Date: Mon Mar 23 2026 - 09:30:51 EST


On 2026-03-20 19:23:01 [+0000], Moritz KLAMMLER (FERCHAU) wrote:
> Hello,
Hi,

> we're running Linux 6.6 with PREEMPT_RT on a single-core armv7l machine
v6.6.109+?

> and observed our devices getting locked-up every few days. We're using
> RT/PI condition variables from librtpi [1] and determined that the RT
> (SCHED_FIFO) thread making the FUTEX_CMP_REQUEUE_PI syscall from within
> pi_cond_broadcast seems to occasionally live-lock inside the kernel.
>
> Thanks to a possibly less than ideal design decision in our system, the
> "producer" thread calling pi_cond_broadcast (i.e. doing the
> FUTEX_CMP_REQUEUE_PI) has a higher priority than the "consumer" threads
> that are waiting on the condition variable (calling pi_cond_timedwait
> which eventually makes a FUTEX_WAIT_REQUEUE_PI call). While this might
> not be ideal, I suppose that it still ought to be allowed; please
> correct me if I should be mistaken on that point.

Not sure why not. Worst case would be that the producer would snap all
locks and see no waiter because the consumer never managed to enqueue.

> What seems to happen next is that when the waiter exceeds its finite
> timeout [2] and half an eye-blink later, the producer thread decides to
The alternative to timeout is signal.

> call FUTEX_CMP_REQUEUE_PI after all, the lower-priority consumer might
> make it to the point where it sets the requeue state to
> Q_REQUEUE_PI_DONE in futex_requeue_pi_wakeup_sync but then gets
> preempted before it has a chance to remove itself from the waiters list.
> Now, the higher-priority producer thread calls futex_requeue_pi_prepare
> which will return false because it sees the Q_REQUEUE_PI_IGNORE.

> Subsequently, futex_proxy_trylock_atomic will fail with -EAGAIN and

So the syscall, that saw Q_REQUEUE_PI_IGNORE, returned and now a second
requeue-PI is attempted?

> futex_requeue "goto retry". Which effectively results in the
> higher-priority RT thread busy-waiting on the lower-priority thread
> forever. It will call cond_resched before the "goto retry" but since it
> is considered the most important task in the system, it doesn't seem to
> be scheduled away anymore.

Yup. Kind of obvious if you put it like this.
What about

diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 7e43839ca7b05..ce02cc715c98d 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -307,8 +307,11 @@ futex_proxy_trylock_atomic(u32 __user *pifutex, struct futex_hash_bucket *hb1,
return -EINVAL;

/* Ensure that this does not race against an early wakeup */
- if (!futex_requeue_pi_prepare(top_waiter, NULL))
+ if (!futex_requeue_pi_prepare(top_waiter, NULL)) {
+ plist_del(&top_waiter->list, &hb1->chain);
+ futex_hb_waiters_dec(hb1);
return -EAGAIN;
+ }

/*
* Try to take the lock for top_waiter and set the FUTEX_WAITERS bit
@@ -709,8 +712,10 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
* We were woken prior to requeue by a timeout or a signal.
* Unqueue the futex_q and determine which it was.
*/
- plist_del(&q->list, &hb->chain);
- futex_hb_waiters_dec(hb);
+ if (!plist_node_empty(&q->list)) {
+ plist_del(&q->list, &hb->chain);
+ futex_hb_waiters_dec(hb);
+ }

/* Handle spurious wakeups gracefully */
ret = -EWOULDBLOCK;


? It compiles and might work.

Sebastian