Re: [PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking

From: Mark Rutland

Date: Fri Mar 20 2026 - 13:38:48 EST


On Fri, Mar 20, 2026 at 05:26:47PM +0100, Thomas Gleixner wrote:
> On Fri, Mar 20 2026 at 15:37, Mark Rutland wrote:
> > On Fri, Mar 20, 2026 at 03:59:40PM +0100, Thomas Gleixner wrote:
> >> Why are you routing NMI like exceptions through irqentry_enter() and
> >> irqentry_exit() in the first place? That's just wrong.
> >
> > Sorry, the above was not clear, and some of this logic is gunk that has
> > been carried over unnnecessarily from our old exception handling flow.
> >
> > The issue with pseudo-NMI is that it uses the same exception as regular
> > interrupts, but we don't know whether we have a pseudo-NMI until we
> > acknowledge the event at the irqchip level. When a pseudo-NMI is taken,
> > there are two possibilities:
> >
> > (1) The pseudo-NMI is taken from a context where interrupts were
> > *disabled*. The entry code immediately knows it must be a
> > pseudo-NMI, and we call irqentry_nmi_{enter,exit}(), NOT
> > irqentry_{enter,exit}(), treating it as an NMI.
> >
> >
> > (2) The pseudo-NMI was taken from a context where interrupts were
> > *enabled*. The entry code doesn't know whether it's a pseudo-NMI or
> > a regular interrupt, so it calls irqentry_{enter,exit}(), and then
> > within that we'll call nmi_{enter,exit}() to transiently enter NMI
> > context.
> >
> > I realise this is crazy. I would love to delete pseudo-NMI.
> > Unfortunately people are using it.
>
> What is it used for?

It's used where people would want an NMI; specifically today that's:

* The PMU interrupt (so people can profile code that has IRQs off).
* IPI_CPU_BACKTRACE (so we can get a backtrace when code has IRQs off).
* IPI_CPU_STOP_NMI (so panic can stop secondaries more reliably).
* IPI_KGDB_ROUNDUP (so KGDB can stop secondaries more reliably).

I mostly care about the first two. People *really* want the PMU interrupt as an
NMI.

> > Putting aside the nesting here, I think it's fine to preempt upon return
> > from case (2), and we can delete the logic to avoid preempting.
>
> Correct.

Cool; thanks for confirming!

> >> That means at the point where irqentry_entry() is invoked, the
> >> architecture side should have made sure that everything is set up for
> >> the kernel to operate until irqentry_exit() returns.
> >
> > Ok. I think you're saying I should try:
> >
> > * At entry, *before* irqentry_enter():
> > - unmask everything EXCEPT regular interrupts.
> > - fix up all the necessary state.
> >
> > * At exception exit, *after* irqentry_exit():
> > - mask everything.
> > - fix up all the necessary state.
> >
> > ... right?
>
> Yes.

Ok; I'll go experiment...

My major concern is that this is liable to make the arm64 entry
sequences substantially more expensive and complicated (see notes
below), but I should go see how bad that is in practice.

My other concern is that I'd like to backport a fix for the issue I
mentioned in the commit message, and I'd like to have something that's
simpler than "rewrite the entire entry code" for that -- for backporting
it'd be easier to revert the move to generic irq entry.

> >> Looking at your example:
> >>
> >> > | static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
> >> > | {
> >> > | unsigned long far = read_sysreg(far_el1);
> >> > | irqentry_state_t state;
> >> > |
> >> > | state = enter_from_kernel_mode(regs);
> >> > | local_daif_inherit(regs);
> >> > | do_mem_abort(far, esr, regs);
> >> > | local_daif_mask();
> >> > | exit_to_kernel_mode(regs, state);
> >>
> >> and the paragraph right below that:
> >>
> >> > Currently, the generic irq entry code will attempt to preempt from any
> >> > exception under irqentry_exit() where interrupts were unmasked in the
> >> > original context. As arm64's entry code will have already masked
> >> > exceptions via DAIF, this results in the problems described above.
> >>
> >> To me this looks like your ordering is wrong. Why are you doing the DAIF
> >> inherit _after_ irqentry_enter() and the mask _before_ irqentry_exit()?
> >
> > As above, I can go look at reworking this.
> >
> > For context, we do it this way today for several reasons, including:
> >
> > (1) Because some of the arch-specific bits (such as checking the TFSR
> > for MTE) in enter_from_kernel_mode() and exit_to_kernel_mode() need
> > to be done while RCU is watching, etc, but needs other exceptions
> > masked. I can look at reworking that.
>
> Something like the below should do that for you. If you need more than
> regs, then you can either stick something on your stack frame or we go
> and extend irqentry_enter()/exit() with an additional argument which
> allows you to hand some exception/interrupt specific cookie in. The core
> code would just hand it through to arch_irqenter_enter/exit_rcu() along
> with @regs. That cookie might be data or even a function pointer. The
> core does not have to know about it.

I don't think that helps for exit, because the contradiction is "while
RCU is watching, etc, but needs other exceptions masked", and as above,
we can't have that, because (with the flow you've suggested) exceptions
aren't masked until after irqentry_exit().

Let me go think a bit harder about that. The exit path for TFSR is
already somewhat best-effort. Maybe the right thing to do is push that
entirely out of the way and re-enter when it indicates a problem.

> > (2) To minimize the number of times we have to write to things like
> > DAIF, as that can be expensive.
> >
> > (3) To simplify the management of things like DAIF, so that we don't
> > have several points in time at which we need to inherit different
> > pieces of state.
>
> The above should cover your #2/3 too, no?

Not really, but we might be talking past one another.

I *think* you're saying that because the arch code would manage DAIF
early during entry and late during exit, that would all be in one place.

However, that doubles the number of times we have to write to DAIF: at
entry we'd have to poke it once to unmask everything except IRQs, then
again to unmask IRQs, and exit would need the inverse. We'd also have to
split the inheritance logic into inherit-everything-but-interrupt and
inherit-only-interrupt, which is doable but not ideal. With pseudo-NMI
it's even worse, but that's largely because pseudo-NMI is
over-complicated today.

As above, I'll go experiment and see how bad this is in practice.

> > (4) Historical, as that's the flow we had in assembly, and prior to the
> > move to generic irq entry.
>
> No comment :)

:)

Mark.

> >> I might be missing something, but this smells more than fishy.
> >>
> >> As no other architecture has that problem I'm pretty sure that the
> >> problem is not in the way how the generic code was designed. Why?
> >
> > Hey, I'm not saying the generic entry code is wrong, just that there's a
> > mismatch between it and what would be optimal for arm64.
> >
> >> Because your architecture is _not_ sooo special! :)
> >
> > I think it's pretty special, but not necessarily in the same sense. ;)
>
> Hehehe.
>
> Thanks,
>
> tglx
> ---
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -149,6 +149,7 @@ noinstr irqentry_state_t irqentry_enter(
> instrumentation_begin();
> kmsan_unpoison_entry_regs(regs);
> trace_hardirqs_off_finish();
> + arch_irqentry_enter_rcu(regs);
> instrumentation_end();
>
> ret.exit_rcu = true;
> @@ -166,6 +167,7 @@ noinstr irqentry_state_t irqentry_enter(
> kmsan_unpoison_entry_regs(regs);
> rcu_irq_enter_check_tick();
> trace_hardirqs_off_finish();
> + arch_irqentry_enter_rcu(regs);
> instrumentation_end();
>
> return ret;
> @@ -225,6 +227,7 @@ noinstr void irqentry_exit(struct pt_reg
> */
> if (state.exit_rcu) {
> instrumentation_begin();
> + arch_irqentry_exit_rcu(regs);
> hrtimer_rearm_deferred();
> /* Tell the tracer that IRET will enable interrupts */
> trace_hardirqs_on_prepare();
> @@ -239,11 +242,13 @@ noinstr void irqentry_exit(struct pt_reg
> if (IS_ENABLED(CONFIG_PREEMPTION))
> irqentry_exit_cond_resched();
>
> + arch_irqentry_exit_rcu(regs);
> hrtimer_rearm_deferred();
> /* Covers both tracing and lockdep */
> trace_hardirqs_on();
> instrumentation_end();
> } else {
> + arch_irqentry_exit_rcu(regs);
> /*
> * IRQ flags state is correct already. Just tell RCU if it
> * was not watching on entry.