Re: [PATCH 2/2] perf: Don't throttle based on NMI watchdog events

From: Calvin Owens

Date: Tue Mar 31 2026 - 17:13:46 EST


On Tuesday 03/31 at 11:10 -0700, Calvin Owens wrote:
> On Tuesday 03/31 at 10:22 -0700, Calvin Owens wrote:
> > On Tuesday 03/31 at 08:25 -0700, Calvin Owens wrote:
> > > @@ -663,6 +666,17 @@ void perf_sample_event_took(u64 sample_len_ns)
> > > if (avg_len <= max_len)
> > > return;
> > >
> > > + /*
> > > + * Very infrequent events like the perf counter hard watchdog
> > > + * can trigger spurious throttling: skip throttling if the prior
> > > + * NMI got here more than one second before this NMI began.
> > > + */
> > > + now = local_clock();
> > > + delta = now - __this_cpu_read(last_throttle_clock);
> > > + __this_cpu_write(last_throttle_clock, now);
> > > + if (delta - sample_len_ns > NSEC_PER_SEC)
> > > + return;
> >
> > Bah, Sashiko caught something obvious I missed:
> >
> > https://sashiko.dev/#/patchset/cover.1774969692.git.calvin%40wbinvd.org
> >
> > >> When the outer handler completes, its sample_len_ns (total execution
> > >> time) will be strictly greater than delta (time since the inner
> > >> handler finished). This guarantees delta < sample_len_ns, causing the
> > >> subtraction to underflow to a massive positive value.
> > >>
> > >> The condition > NSEC_PER_SEC will then evaluate to true, and the outer
> > >> handler will erroneously skip the perf throttling logic. Should this
> > >> check be rewritten to avoid subtraction, perhaps by using if (delta >
> > >> sample_len_ns + NSEC_PER_SEC)?
> >
> > The solution it proposed makes sense to me.
>
> I replied too quickly: I think Sashiko is actually wrong.

Last time, I swear to god. I worked this out, it is indeed correct.

The relevant RMW is:

now = local_clock()
delta = now = last_throttle_clock;
last_throttle_clock = now

Assume last_throttle_clock starts at zero.

Normal case:

NMI >>> sample_len_ns=1000ns
now = 1010
delta = 1010
last_throttle_clock = 1010
(1010 - 0 >_NSEC_PER_SEC) == false

Nesting case 1:

NMI >>> sample_len_ns=1000ns
now = 1010
NMI >>> sample_len_ns=1000ns
now = 2020
delta = 2020;
last_throttle_clock = 2020
(2020 - 0 > NSEC_PER_SEC) == false
// does not skip throttle
delta = *underflow*
last_throttle_clock = 1010
(1010 - *underflow* > NSEC_PER_SEC) == true
// skips throttle

Nesting case 2:

NMI >>> sample_len_ns=1000ns
now = 1010
delta = 1010
NMI >>> sample_len_ns=1000ns
now = 2020
delta = 2020
last_throttle_clock = 2020
(2020 - 0 > NSEC_PER_SEC) == false
// does not skip throttle
last_throttle_clock = 1010
(1010 - 1000 > NSEC_PER_SEC) == true
// skips throttle

I think the below deals with it. But I will wait to hear back before
sending a V2.

Thanks,
Calvin

---
kernel/events/core.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 89b40e439717..c51d61fbb03b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -623,6 +623,7 @@ core_initcall(init_events_core_sysctls);
*/
#define NR_ACCUMULATED_SAMPLES 128
static DEFINE_PER_CPU(u64, running_sample_length);
+static DEFINE_PER_CPU(u64, last_throttle_clock);

static u64 __report_avg;
static u64 __report_allowed;
@@ -643,6 +644,8 @@ void perf_sample_event_took(u64 sample_len_ns)
u64 max_len = READ_ONCE(perf_sample_allowed_ns);
u64 running_len;
u64 avg_len;
+ u64 last;
+ u64 now;
u32 max;

if (max_len == 0)
@@ -663,6 +666,18 @@ void perf_sample_event_took(u64 sample_len_ns)
if (avg_len <= max_len)
return;

+ /*
+ * Very infrequent events like the perf counter hard watchdog
+ * can trigger spurious throttling: skip throttling if the prior
+ * NMI got here more than one second before this NMI began. But
+ * if NMIs are nesting, never skip throttling.
+ */
+ now = local_clock();
+ last = __this_cpu_read(last_throttle_clock);
+ if (this_cpu_try_cmpxchg(last_throttle_clock, last, now) &&
+ now - last > NSEC_PER_SEC)
+ return;
+
__report_avg = avg_len;
__report_allowed = max_len;

--
2.47.3