Re: [patch 5/5] clocksource: Rewrite watchdog code completely

From: Daniel J Blueman

Date: Wed Mar 18 2026 - 10:26:30 EST

On Tue, 17 Mar 2026 at 17:01, Thomas Gleixner <tglx@xxxxxxxxxx> wrote:
>
> On Sun, Mar 15 2026 at 22:59, Daniel J. Blueman wrote:
> > On Mon, 23 Feb 2026 at 21:53, Thomas Gleixner <tglx@xxxxxxxxxx> wrote:
> > With that said, on the 16 socket (1920 thread) setup, we see most
> > remote calls end up timing out with WATCHDOG_READOUT_MAX_US at 50,
> > leading to excessive logging. pr_info_once() would be a good approach
> > to avoid the spam, however I still feel we should use a higher
> > (250-500us?) timeout to keep the mechanism effective.
> >
> > I also feel if a remote hardware thread is seen to timeout, retrying
> > has a high likelyhood of timing out also, so it may be cheaper in the
> > bigger picture to not retry. Sensitivity could be increased by walking
> > threads in socket order (S0T0 ... S15T0 S0T1 ... S15T1 ...). These two
> > items are my only concerns.
>
> Right. So I ditched the immediate retry and replaced the hard coded
> timeout with a runtime calculated one when NUMA is enabled. It's a
> reasonable assumption that insanely big systems have properly
> initialized SLIT/SRAT tables. So we can just [ab]use the node distance
> to determine the timeout for a remote CPU.

The node distance matches well: changing "if (wd_seq >
WATCHDOG_READOUT_MAX_NS) {" to use WATCHDOG_NUMA_MAX_TIMEOUT_NS, this
presented no regressions on 16 sockets/1920 threads during idle and a
"stress-ng --msyncmany 0" all thread adverse workload for 2h.

> ---
> From: Thomas Gleixner <tglx@xxxxxxxxxx>
> Subject: clocksource: Rewrite watchdog code completely
> Date: Sat, 24 Jan 2026 00:18:01 +0100
>
> From: Thomas Gleixner <tglx@xxxxxxxxxx>
>
> The clocksource watchdog code has over time reached the state of an
> unpenetrable maze of duct tape and staples. The original design, which was
> made in the context of systems far smaller than today, is based on the
> assumption that the to be monitored clocksource (TSC) can be trivially
> compared against a known to be stable clocksource (HPET/ACPI-PM timer).
>
> Over the years it turned out that this approach has major flaws:
>
> - Long delays between watchdog invocations can result in wrap arounds
> of the reference clocksource
>
> - Scalability of the reference clocksource readout can degrade on large
> multi-socket systems due to interconnect congestion
>
> This was addressed with various heuristics which degraded the accurracy of
> the watchdog to the point that it fails to detect actual TSC problems on
> older hardware which exposes slow inter CPU drifts due to firmware
> manipulating the TSC to hide SMI time.
>
> To address this and bring back sanity to the watchdog, rewrite the code
> completely with a different approach:
>
> 1) Restrict the validation against a reference clocksource to the boot
> CPU, which is usually the CPU/Socket closest to the legacy block which
> contains the reference source (HPET/ACPI-PM timer). Validate that the
> reference readout is within a bound latency so that the actual
> comparison against the TSC stays within 500ppm as long as the clocks
> are stable.
>
> 2) Compare the TSCs of the other CPUs in a round robin fashion against
> the boot CPU in the same way the TSC synchronization on CPU hotplug
> works. This still can suffer from delayed reaction of the remote CPU
> to the SMP function call and the latency of the control variable cache
> line. But this latency is not affecting correctness. It only affects
> the accuracy. With low contention the readout latency is in the low
> nanoseconds range, which detects even slight skews between CPUs. Under
> high contention this becomes obviously less accurate, but still
> detects slow skews reliably as it solely relies on subsequent readouts
> being monotonically increasing. It just can take slightly longer to
> detect the issue.
>
> 3) Rewrite the watchdog test so it tests the various mechanisms one by
> one and validating the result against the expectation.
>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxx>
> Tested-by: Borislav Petkov (AMD) <bp@xxxxxxxxx>
> Reviewed-by: Jiri Wiesner <jwiesner@xxxxxxx>
> Link: https://patch.msgid.link/20260123231521.926490888@xxxxxxxxxx
> ---
> V2: Make it more cache line friendly and tweak it further for insane big
> machines - Daniel

Signed-off-by: Daniel J Blueman <daniel@xxxxxxxxx>

Thanks and great work Thomas!
Dan
--
Daniel J Blueman