Re: [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events

From: Calvin Owens

Date: Wed Apr 29 2026 - 18:10:15 EST

On Wednesday 04/29 at 20:38 +0000, sashiko-bot@xxxxxxxxxx wrote:
> > @@ -636,6 +639,19 @@ void perf_sample_event_took(u64 sample_len_ns)
> > if (avg_len <= max_len)
> > return;
> >
> > + /*
> > + * Very infrequent events like the perf counter hard watchdog
> > + * can trigger spurious throttling: skip throttling if the prior
> > + * NMI got here more than one second before this NMI began. But
> > + * never skip throttling if NMIs are nesting, or if any NMI runs
> > + * for longer than one second.
> > + */
> > + now = local_clock();
> > + last = __this_cpu_read(last_throttle_clock);
> > + if (__this_cpu_cmpxchg(last_throttle_clock, last, now) == last &&
>
> On 32-bit architectures, is it safe to use __this_cpu_cmpxchg() on a 64-bit
> value in NMI context?
>
> For architectures like 32-bit x86, the generic fallback implementation
> achieves atomicity by disabling interrupts, which does not mask NMIs. If a
> nested NMI interrupts the sequence halfway through reading or writing the
> 32-bit halves of the u64, could this result in a torn read or write?
>
> A corrupted timestamp could cause a massive wrap-around in the time gap
> calculation, perpetually satisfying the > NSEC_PER_SEC bypass condition and
> silently disabling PMU throttling for all events on that CPU.
>
> > + now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_SEC)

If this is a problem, isn't it also a problem for the 64-bit store after
updating the EWMA just above this?

I guess last_throttle_clock could be a u32 and use the low clock bits,
that's sufficient with the one second limit... but I would appreciate a
real human opinion :)

> The comment indicates the intent is to skip throttling if the prior NMI
> occurred more than one second before this NMI began.

My comment is not very clear, I suppose. It should say:

"...skip throttling if the prior attempt to throttle occurred more
than one second ago, and the current NMI runtime was less than one
second. But never skip throttling if NMIs are nesting."

> However, local_clock() is sampled at the end of the current NMI, so now
> includes the current NMI's duration.
>
> If an NMI takes a very long time (e.g., 900ms) but started only 200ms after
> the previous NMI ended, the true gap before it began is 200ms. Because now
> includes the 900ms duration, now - last evaluates to 1100ms, which bypasses
> the throttling.
>
> Could this artificially inflate the time gap for long-running NMIs,
> directly defeating the throttling mechanism?

I think it's just saying "the code doesn't do what the comment says",
which is true. But what it is describing is sort of addressed below too.

> Additionally, is it possible for user-defined PMU events with expensive
> overhead (like heavy eBPF programs attached to slow tracepoints) to exploit
> this hardcoded frequency bypass?

No, I don't think ebpf programs run in NMI context?

> If an event triggers slower than 1Hz but takes a very long time to execute
> (e.g., 999ms), both now - last > NSEC_PER_SEC and sample_len_ns <
> NSEC_PER_SEC will be true.
>
> Such an event would bypass the moving average calculation and never be
> throttled, allowing it to consume nearly 100% of the CPU while it runs and
> violating the default 25% CPU time limit set by
> sysctl_perf_cpu_time_max_percent.

No matter where the limit is defined, NMIs firing at just under that
limit will skip throttling. But I guess it could be something like:

now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_MSEC

...to sort of costrain the worst case.

But this all feels really arbitrary. I thought about using the watchdog
interval sysctl minus some padding as the upper bound, but that seemed
unnecessarily complicated.

Maybe this second patch is just more trouble than its worth...
especially if, as Andi noted earlier, the PMU watchdog is not long for
this world...

Thanks,
Calvin

> --
> Sashiko AI review · https://sashiko.dev/#/patchset/cover.1777483573.git.calvin@xxxxxxxxxx?part=2