Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
From: Peter Zijlstra
Date: Tue Apr 28 2026 - 07:45:19 EST
On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote:
>
> > kernel/sched/core.c | 11 +++++++++++
> > 1 file changed, 11 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index da20fb6ea25ae..5f684caefd8b2 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode)
> > try_to_block_task(rq, prev, &prev_state,
> > !task_is_blocked(prev));
> > switch_count = &prev->nvcsw;
> > + } else if (preempt && prev->blocked_on) {
> > + /*
> > + * If we are SM_PREEMPT, we may have interrupted
> > + * after blocked_on was set, before schedule()
> > + * was run, preventing workques from running. So
>
> workqueues
>
> > + * clear blocked_on and mark task RUNNING so it
> > + * can be reselected to run and complete its
> > + * logic
> > + */
> > + WRITE_ONCE(prev->__state, TASK_RUNNING);
> > + clear_task_blocked_on(prev, NULL);
> > }
> >
> > pick_again:
>
> *groan*, this feels wrong. Preemption should never touch state. Let me
> try and wake up and make sense of this.
So all non-special block states *SHOULD* be in a loop and handle
spurious wakeups -- I fixed a pile of offenders some many years ago, but
there really isn't anything in the kernel that validates this.
[ I suppose someone could try and do a cocci test for this? ]
Any wait for non-special states that is not a loop is fundamentally
broken, since many of the lock wake-up paths are explicitly racy in that
they can cause spurious wakeups (which is the safe side of the race,
since insufficient wakeups is bad etc.).
OTOH special states, are special, esp. because they cannot handle
spurious wakeups.
Eg, consider something like:
set_current_state(TASK_FROZEN)
<PREEMPT>
current->__state = TASK_RUNNING
</PREEMPT/
schedule();
is all sorts of broken. Now, obiously special states must never have
blocked_on set, so this can be fudged about. But still, touching __state
from schedule seems wrong.
Anyway, the historical distinction between a blocked task and a
preempted task is that the blocked task is not on the runqueue, while
the preempted task is kept on the runqueue.
Obviously PE wrecks this, and hence the problem. And yeah, amazing we
never hit this before.
Something like so perhaps?
---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb5..0bd5da8360f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -846,7 +846,11 @@ struct task_struct {
struct alloc_tag *alloc_tag;
#endif
- int on_cpu;
+ u8 on_cpu;
+ u8 on_rq;
+ u8 is_blocked;
+ u8 __pad;
+
struct __call_single_node wake_entry;
unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
@@ -861,7 +865,6 @@ struct task_struct {
*/
int recent_used_cpu;
int wake_cpu;
- int on_rq;
int prio;
int static_prio;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25a..06817ae0cbd9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state);
* [ The astute reader will observe that it is possible for two tasks on one
* CPU to have ->on_cpu = 1 at the same time. ]
*
+* p->is_blocked <- { 0, 1 }:
+*
+* is set by block_task() and cleared by ttwu_do_activate() and indicates
+* this task is blocked, as opposed to runnable. Used to distinguish between
+* preempted and blocked tasks for proxy exec, which keeps everything on the
+* runqueue.
+ *
* task_cpu(p): is changed by set_task_cpu(), the rules are:
*
* - Don't call set_task_cpu() on a blocked task:
@@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
static void block_task(struct rq *rq, struct task_struct *p, int flags)
{
+ p->is_blocked = 1;
if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
__block_task(rq, p);
}
@@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
atomic_dec(&task_rq(p)->nr_iowait);
}
+ p->is_blocked = 0;
activate_task(rq, p, en_flags);
wakeup_preempt(rq, p, wake_flags);
@@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode)
struct task_struct *prev_donor = rq->donor;
rq_set_donor(rq, next);
- if (unlikely(next->blocked_on)) {
+ if (unlikely(next->is_blocked && next->blocked_on)) {
next = find_proxy_task(rq, next, &rf);
if (!next) {
zap_balance_callbacks(rq);