Re: [RFC PATCH 1/4] timekeeping: Remove xtime_remainder from ntp_error accumulation

From: David Woodhouse

Date: Tue May 19 2026 - 05:22:45 EST

On Mon, 2026-05-18 at 18:37 -0700, John Stultz wrote:
> On Sat, May 16, 2026 at 1:25 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> > Thanks. This has been making my brain hurt for most of the last week,
> > but I think I finally have a handle on it.
> >
> > It looks like we track three different times (ignoring units):
> >
> > • A: The xtime that we actually output to the vDSO/etc.
> >
> > • B: xtime+ntp_error is the time we *want* to be outputting right now,
> >       but the mult dithering and monotonicity clawback keep us from it.
>
> I think of B, or specifically ntp_error, as the delta between our time
> and what *NTP* (well, the in-kernel ntp machinery) wants the system to
> be right now.

Right. I think we're entirely on the same page there. The "right now"
is the key.

> As an aside, apologies if I'm asking obvious questions here, some of
> your terminology is unfamiliar. While it's not used around the code or
> patches, I can understand the dithering metaphor for the long-term
> error adjustments to effectively allow for sub-integer mult
> adjustments over time (similar to b&w dithering to approximate levels
> of grey), but there is also the common use of indecisively dithering
> time away or the astrophotography sense of intentionlally adding
> noise, which I worry might cause some confusion to others as to what
> you mean.

Indeed. The two adjacent values of 'mult' will effectively make xtime
proceed slightly faster than, or slightly slower than, the actual
desired rate. It is exactly 'dithering' in the sense of approximating
levels of grey, tracking ntp_error precisely in order to choose between
mult vs. mult+1 for the next tick.

> Also I'm not sure its very clear what you mean by "monotonicity clawback".

This is the offset applied by timekeeping_apply_adjustment() in order
to ensure that the observed xtime remains monotonic when the dithering
switches back from 'mult+1' to 'mult' and a consumer may have seen a
'later' time than it's about to set in {cycle_last,mult}.

(In fact it also prevents a forward jump, but IIUC that's less
critical.)

That offset was not being accounted in ntp_error, and was accumulating
over time in my tests.

> > • C: xtime+ntp_error+time_offset is the time we *eventually* want to
> >       output, once we've finished skewing towards it.
>
> And yeah, I'd say C is where the userland NTP wants it to slew towards.
>
>
> > On each tick:
> >
> > • xtime (A) advances by xtime_interval (and the clawback; there's
> >    another patch in my tree to account for that in ntp_error now).
> >
> > • (B) advances by whatever tick_length happens to be at the moment
> >    (adjusted by second_overflow to effect a skew).
> >
> > • (C) advances by the original tick_length_base actually set according
> >    to the adjtimex frequency.
> >
> > So ntp_error, being the delta between (B) and (A), needs to advance by
> > tick_length - xtime_interval. Before this patch, xtime_remainder was
> > *also* being subtracted from the 'what xtime advanced' side, but it
> > isn't actually added to xtime; it *is* roughly the amount that needed
> > to be accumulated in ntp_error here (except for the fact that
> > xtime_remainder was calculated once at boot time and never updated).
>
> Again, I'm sure it could be miscalculated, or be misapplied, but as I
> mentioned previously, the xtime_remainder is trying to address a
> granularity error that is effectively baked into the delta between
> xtime_interval and the initial ntp interval (essentially the initial
> ntp_tick), which doesn't seem to be addressed here.

My understanding is that xtime_remainder *is* the delta between the
initial xtime_interval and the initial ntp_interval. Calculated once at
boot and then permanently out of sync when mult and thus xtime_interval
actually change.

By simply adding ntp_interval and then subtracting the correct current
value of xtime_interval, logarithmic_accumulation doesn't *need* a
separate variable which tracks that delta. I don't quite understand why
it ever did add it in the first place.

> For fine-grained clocksources like the TSC its not likely a big issue,
> but for coarser grained clocksources it seems like just removing this
> would be a regression.

I think it should be fine with coarser grained clocksources. The
dithering sawtooth around the reference line will have a higher
amplitude, but adding ntp_interval and subtracting xtime_interval for
the mult value currently in effect is still the right thing to do.

> > I spent a while booting kernels in QEMU with a VMClock reference clock
> > precisely specifying the TSC to real-time relationship for (C), and
> > tracking the *nanosecond* delta of the output from what it was told.
> >
> > I made it redundantly track (B) and (C) above as absolute values, so
> > that I could spot per-tick when the accounting of the existing *deltas*
> > in ntp_error and time_offset went astray (and compare C with the actual
> > refclock, to check that too).
> >
> > In my test tree, xtime now correctly dithers precisely around the
> > desired time (B) and doesn't continually drift from it any more.
> >
> > And I can even inject, say, a 10μs delta via time_offset and it'll skew
> > by *exactly* 10,000ns and stay there, still with with single digit
> > nanosecond divergence from where it's meant to be.
>
> Sounds like very nice results!
>
> > So now I don't *need* the external oracle to drive the dithering per
> > tick for the reference clock (as I had added in patch 3 in this
> > series), because the kernel can actually stay on the y=mx+c line
> > configured by tick_length and ntp_error/time_offset all by itself, so
> > all I have to do is set the *existing* parameters in
> > timekeeping_set_reference().
> >
> > (Arithmetic precision would *eventually* catch up with it, of course,
> > but in reality it won't be following a hard-coded refclock and the
> > reference itself will be periodically adjusted as the real counter
> > varies, and reclamping will occur.)
>
> I'm excited your efforts seem to be doing well! The ntp/timekeeping
> core has been pretty static for awhile now, outside of some of the ptp
> work, so its nice to see new major contributions. That said, it being
> in place for so long, and apparently working for folks, means changes
> do need close consideration to make sure we don't introduce subtle
> regressions.
>

Absolutely. And I do definitely also need to do more testing with
coarser grained clock sources. I've obviously *tried* to ensure that
nothing I've done makes assumptions based around my test environment
which happens to be a 2400MHz TSC and CONFIG_HZ=1000, but we'll see how
it fares in actual testing. I'll also need to test a nohz setup.

As I see it, the timekeeping code already has a fairly clean split
between the core timekeeper and the ntp adjustment. The former is
*mostly* supposed to just keep the time progressing along a clean
y=mx+c relationship from counter to observed xtime, in the common case
where time_offset is zero. And setting time_offset should change the
value of 'c' in that formula and exponentially eventually converge to
*precisely* that delta away from the link it was previously keeping.

All I've done here, or so I believe, is make the timekeeping code
better at following that line it's given. The way that the NTP side
*gives* it that line to follow is fairly much untouched.

Attachment: smime.p7s
Description: S/MIME cryptographic signature