Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking

From: Mark Rutland

Date: Fri Mar 20 2026 - 11:37:30 EST

On Fri, Mar 20, 2026 at 03:59:40PM +0100, Thomas Gleixner wrote:
> On Fri, Mar 20 2026 at 11:30, Mark Rutland wrote:
> > 4) When 'pseudo-NMI' is used, Linux masks interrupts via a combination
> > of DAIF and the 'PMR' priority mask register. At entry and exit,
> > interrupts must be masked via DAIF, but most kernel code will
> > mask/unmask regular interrupts using PMR (e.g. in local_irq_save()
> > and local_irq_restore()).
> >
> > This requires more complicated transitions at entry and exit. Early
> > during entry or late during return, interrupts are masked via DAIF,
> > and kernel code which manipulates PMR to mask/unmask interrupts will
> > not function correctly in this state.
> >
> > This also requires fairly complicated management of DAIF and PMR when
> > handling interrupts, and arm64 has special logic to avoid preempting
> > from pseudo-NMIs which currently lives in
> > arch_irqentry_exit_need_resched().
>
> Why are you routing NMI like exceptions through irqentry_enter() and
> irqentry_exit() in the first place? That's just wrong.

Sorry, the above was not clear, and some of this logic is gunk that has
been carried over unnnecessarily from our old exception handling flow.

The issue with pseudo-NMI is that it uses the same exception as regular
interrupts, but we don't know whether we have a pseudo-NMI until we
acknowledge the event at the irqchip level. When a pseudo-NMI is taken,
there are two possibilities:

(1) The pseudo-NMI is taken from a context where interrupts were
*disabled*. The entry code immediately knows it must be a
pseudo-NMI, and we call irqentry_nmi_{enter,exit}(), NOT
irqentry_{enter,exit}(), treating it as an NMI.

(2) The pseudo-NMI was taken from a context where interrupts were
*enabled*. The entry code doesn't know whether it's a pseudo-NMI or
a regular interrupt, so it calls irqentry_{enter,exit}(), and then
within that we'll call nmi_{enter,exit}() to transiently enter NMI
context.

I realise this is crazy. I would love to delete pseudo-NMI.
Unfortunately people are using it.

Putting aside the nesting here, I think it's fine to preempt upon return
from case (2), and we can delete the logic to avoid preempting.

> > 5) Most kernel code runs with all exceptions unmasked. When scheduling,
> > only interrupts should be masked (by PMR pseudo-NMI is used, and by
> > DAIF otherwise).
> >
> > For most exceptions, arm64's entry code has a sequence similar to that
> > of el1_abort(), which is used for faults:
> >
> > | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> > | {
> > | unsigned long far = read_sysreg(far_el1);
> > | irqentry_state_t state;
> > |
> > | state = enter_from_kernel_mode(regs);
> > | local_daif_inherit(regs);
> > | do_mem_abort(far, esr, regs);
> > | local_daif_mask();
> > | exit_to_kernel_mode(regs, state);
> > | }
> >
> > ... where enter_from_kernel_mode() and exit_to_kernel_mode() are
> > wrappers around irqentry_enter() and irqentry_exit() which perform
> > additional arm64-specific entry/exit logic.
> >
> > Currently, the generic irq entry code will attempt to preempt from any
> > exception under irqentry_exit() where interrupts were unmasked in the
> > original context. As arm64's entry code will have already masked
> > exceptions via DAIF, this results in the problems described above.
>
> See below.
>
> > Fix this by opting out of preemption in irqentry_exit(), and restoring
> > arm64's old behaivour of explicitly preempting when returning from IRQ
> > or FIQ, before calling exit_to_kernel_mode() / irqentry_exit(). This
> > ensures that preemption occurs when only interrupts are masked, and
> > where that masking is compatible with most kernel code (e.g. using PMR
> > when pseudo-NMI is in use).
>
> My gut feeling tells me that there is a fundamental design flaw
> somewhere and the below is papering over it.
>
> > @@ -497,6 +497,8 @@ static __always_inline void __el1_irq(struct pt_regs *regs,
> > do_interrupt_handler(regs, handler);
> > irq_exit_rcu();
> >
> > + irqentry_exit_cond_resched();
> > +
> > exit_to_kernel_mode(regs, state);
> > }
> > static void noinstr el1_interrupt(struct pt_regs *regs,
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index 9ef63e4147913..af9cae1f225e3 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -235,8 +235,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> > }
> >
> > instrumentation_begin();
> > - if (IS_ENABLED(CONFIG_PREEMPTION))
> > + if (IS_ENABLED(CONFIG_PREEMPTION) &&
> > + !IS_ENABLED(CONFIG_ARCH_HAS_OWN_IRQ_PREEMPTION)) {
>
> These 'But my architecture is sooo special' switches cause immediate review
> nausea and just confirm that there is a fundamental flaw somewhere else.
>
> > irqentry_exit_cond_resched();
>
> Let's look at how this is supposed to work. I'm just looking at
> irqentry_enter()/exit() and not the NMI variant.
>
> Interrupt/exception is raised
>
> 1) low level architecture specific entry code does all the magic state
> saving, setup etc.
>
> 2) irqentry_enter() is invoked
>
> - checks for user mode or kernel mode entry
>
> - handles RCU on enter from user and if kernel entry hits the idle
> task
>
> - Sets up lockdep, tracing, kminsanity
>
> 3) the interrupt/exception handler is invoked
>
> 4) irqentry_exit() is invoked
>
> - handles exit to user and exit to kernel
>
> - exit to user handles the TIF and other pending work, which can
> schedule and then prepares for return
>
> - exit to kernel
>
> When interrupt were disabled on entry, it just handles RCU and
> returns.
>
> When enabled on entry, it checks whether RCU was watching on
> entry or not. If not it tells RCU that the interrupt nesting is
> done and returns. When RCU was watching it can schedule
>
> 5) Undoes #1 so that it can return to the originally interrupted
> context.
>
> That means at the point where irqentry_entry() is invoked, the
> architecture side should have made sure that everything is set up for
> the kernel to operate until irqentry_exit() returns.

Ok. I think you're saying I should try:

* At entry, *before* irqentry_enter():
- unmask everything EXCEPT regular interrupts.
- fix up all the necessary state.

* At exception exit, *after* irqentry_exit():
- mask everything.
- fix up all the necessary state.

... right?

> Looking at your example:
>
> > | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> > | {
> > | unsigned long far = read_sysreg(far_el1);
> > | irqentry_state_t state;
> > |
> > | state = enter_from_kernel_mode(regs);
> > | local_daif_inherit(regs);
> > | do_mem_abort(far, esr, regs);
> > | local_daif_mask();
> > | exit_to_kernel_mode(regs, state);
>
> and the paragraph right below that:
>
> > Currently, the generic irq entry code will attempt to preempt from any
> > exception under irqentry_exit() where interrupts were unmasked in the
> > original context. As arm64's entry code will have already masked
> > exceptions via DAIF, this results in the problems described above.
>
> To me this looks like your ordering is wrong. Why are you doing the DAIF
> inherit _after_ irqentry_enter() and the mask _before_ irqentry_exit()?

As above, I can go look at reworking this.

For context, we do it this way today for several reasons, including:

(1) Because some of the arch-specific bits (such as checking the TFSR
for MTE) in enter_from_kernel_mode() and exit_to_kernel_mode() need
to be done while RCU is watching, etc, but needs other exceptions
masked. I can look at reworking that.

(2) To minimize the number of times we have to write to things like
DAIF, as that can be expensive.

(3) To simplify the management of things like DAIF, so that we don't
have several points in time at which we need to inherit different
pieces of state.

(4) Historical, as that's the flow we had in assembly, and prior to the
move to generic irq entry.

> I might be missing something, but this smells more than fishy.
>
> As no other architecture has that problem I'm pretty sure that the
> problem is not in the way how the generic code was designed. Why?

Hey, I'm not saying the generic entry code is wrong, just that there's a
mismatch between it and what would be optimal for arm64.

> Because your architecture is _not_ sooo special! :)

I think it's pretty special, but not necessarily in the same sense. ;)

Mark.