Re: [PATCH] x86/mce: Restore MCA polling interval halving
From: Luck, Tony
Date: Tue Apr 14 2026 - 18:22:36 EST
On Tue, Apr 14, 2026 at 11:18:03PM +0200, Borislav Petkov wrote:
> On Tue, Apr 07, 2026 at 03:04:04PM +0000, Zhuo, Qiuxu wrote:
> > I injected a correctable error with the CMCI interrupt enabled on an Intel testing machine,
> > and this mce_early_notifier() was invoked. But the following code in mce_notify_irq() is now
> > never executed, and I didn't see the error log message "Machine check events logged".
>
> You did disable the CEC, right?
>
> In any case, let's have a look:
>
> When we log an MCE, we do:
>
> mce_log # add it to the genpool and run the works
> -> mce_irq_work
> -> mce_schedule_work
> -> ..
> -> mce_gen_pool_process # this'll send it down the notifier chain
> -> x86_mce_decoder_chain
> -> mce_early_notifier # that guy sees it here and issues the trace record
>
> Now, mce_notify_irq() would do mce_work_trigger() and issue the printk
> - dunno, I guess we still want our printk and probably should add it back
> - but the first one - the work triggering - that's mcelog. It is using that
> usermode helper gunk, dunno if you guys still need it.
>
> Because mcelog does register to the decoder chain so it'll get to see the MCE
> eventually. So that part is fine.
>
> The only question is the usermode helper gunk...
>
> Tony?
Ran my own test. RAS_CEC disabled. Booted with mce=no_cmci injected a
corrected error every twenty seconds. Added pr_info() to mce_timer_fn()
to say which CPUs were doubling or halving interval.
Results:
I did see some "Machine check events logged" console messages.
The debug messages are "interesting". Polling timers on CPUs aren't
synchronized, so I got random bursts of debug messages where some
CPUs found an error and halved their interval, while others didn't
see an error and doubled their interval. The machine check banks for
memory corrected errors are socket scoped, so when an error is logged
whichever CPU on the socket polls next will find the error.
Both mcelog and EDAC were invoked on the mce decode chain and logged
errors OK.
When I stopped injecting, all the CPUs doubled back up to maximum
polling interval.
Summary: This is working as well as can be expected given the shared
scope of the machine check banks. If Linux were to understand the
scope of machine check banks it might designate a single CPU in
that scope to do the polling. But Intel doesn't make it easy to derive
the scope. In any case, the common case is CMCI enabled.
-Tony