Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking

From: Thomas Gleixner

Date: Fri Mar 20 2026 - 10:12:33 EST

On Fri, Mar 20 2026 at 14:04, Peter Zijlstra wrote:
> On Fri, Mar 20, 2026 at 11:30:25AM +0000, Mark Rutland wrote:
>> Thomas, Peter, I have a couple of things I'd like to check:
>>
>> (1) The generic irq entry code will preempt from any exception (e.g. a
>> synchronous fault) where interrupts were unmasked in the original
>> context. Is that intentional/necessary, or was that just the way the
>> x86 code happened to be implemented?
>>
>> I assume that it'd be fine if arm64 only preempted from true
>> interrupts, but if that was intentional/necessary I can go rework
>> this.
>
> So NMI-from-kernel must not trigger resched IIRC. There is some code
> that relies on this somewhere. And on x86 many of those synchronous
> exceptions are marked as NMI, since they can happen with IRQs disabled
> inside locks etc.
>
> But for the rest I don't think we care particularly. Notably page-fault
> will already schedule itself when possible (faults leading to IO and
> blocking).

Right. In general we allow preemption on any interrupt, trap and exception
when:

1) the interrupted context had interrupts enabled

2) RCU was watching in the original context

This _is_ intentional as there is no reason to defer preemption in such
a case. The RT people might get upset if you do so.

NMI like exceptions, which are not allowed to schedule, should therefore
never go through irqentry_irq_entry() and irqentry_irq_exit().

irqentry_nmi_enter() and irqentry_nmi_exit() exist for a technical
reason and are not just of decorative nature. :)

>> (2) The generic irq entry code only preempts when RCU was watching in
>> the original context. IIUC that's just to avoid preempting from the
>> idle thread. Is it functionally necessary to avoid that, or is that
>> just an optimization?
>>
>> I'm asking because historically arm64 didn't check that, and I
>> haven't bothered checking here. I don't know whether we have a
>> latent functional bug.
>
> Like I told you on IRC, I *think* this is just an optimization, since if
> you hit idle, the idle loop will take care of scheduling. But I can't
> quite remember the details here, and wish we'd have written a sensible
> comment at that spot.

There is one, but it's obviously not detailed enough.

> Other places where RCU isn't watching are userspace and KVM. The first
> isn't relevant because this is return-to-kernel, and the second I'm not
> sure about.
>
> Thomas, can you remember?

Yes. It's not an optimization. It's a correctness issue.

If the interrupted context is RCU idle then you have to carefully go
back to that context. So that the context can tell RCU it is done with
the idle state and RCU has to pay attention again. Otherwise all of this
becomes imbalanced.

This is about context-level nesting:

...
L1.A ct_cpuidle_enter();

-> interrupt
L2.A ct_irq_enter();
... // Set NEED_RESCHED
L2.B ct_irq_exit();

...
L1.B ct_cpuidle_exit();

Scheduling between #L2.B and #L1.B makes RCU rightfully upset. Think
about it this way:

L1.A preempt_disable();
L2.A local_bh_disable();
..
L2.B local_bh_enable();
if (need_resched())
schedule();
L1.B preempt_enable();

RCU is not any different. For context-level nesting of any kind the only
valid order is:

L1.A -> L2.A -> L2.B -> L1.B

Pretty obvious if you actually think about it, no?

Thanks,

tglx