Re: [RFC PATCH 1/4] timekeeping: Remove xtime_remainder from ntp_error accumulation

From: David Woodhouse

Date: Sat May 16 2026 - 04:25:59 EST


On Fri, 2026-05-15 at 15:49 -0700, John Stultz wrote:
> On Wed, May 13, 2026 at 2:02 PM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> >
> > From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> >
> > The ntp_error accumulator tracks the difference between intended and
> > actual clock advance. Each tick it adds ntp_tick (the intended advance)
> > and subtracts what the clock actually advanced.
> >
> > The subtraction was (xtime_interval + xtime_remainder), but only
> > xtime_interval is actually added to xtime_nsec each tick.
> > xtime_remainder was a boot-time constant representing the rounding error
> > from converting the tick period to an integer number of counter cycles.
> > It was never added to xtime_nsec, so subtracting it from ntp_error
> > created a phantom credit that biased the dithering ratio.
> >
> > The effect is a systematic drift whose magnitude depends on the value of
> > xtime_remainder and the NTP frequency correction. NTP masks this by
> > continuously adjusting the frequency to compensate, but with a fixed
> > frequency (or an external reference clock like vmclock), the drift is
> > exposed.
> >
> > Also remove xtime_remainder from the mult computation in
> > timekeeping_adjust(), which used it to offset the division for the same
> > (incorrect) reason.
> >
> > Fixes: a386b5af8edd ("time: Compensate for rounding on odd-frequency clocksources")
>
> Hey, thanks for sending this out!
>
> Looks like it would be more accurate to say it is a revert then a fix?
>
> Hrm. So this is maybe too far back for my goldfish brain to really remember.
>
> My hazy sense of it was that for coarser-grained clocksources, there
> will be an inherent granularity error for what we want to use as an
> accumulation interval, and the hardware allows.
>
> So if we want a 1ms ntp interval, taking the hardware freq and finding
> cycles/ms, might be something like 10.4 (imaginary example here), so
> that rounds 10 cycles per interval which is really gives us a 962us
> interval.
>
> Now, the 10.4 cycles/ms is still accurate, and the resulting
> xtime_interval is still correct, but there is a resulting 38us
> difference between the xtime_interval and the ntp_tick due to this
> granularity error in the calculation., When we start accumulating
> ntp_error, it will grow at that 38us/ms rate, causing the internal
> logic to try to steer the clock faster to compensate.
>
> Thus, instead of assuming the ntp-interval is actually 1ms, we want it
> to match the actual accumulation interval, so we subtract that error
> off of the ntp-interval.  That "phantom credit" is effectively
> intended to address this inherent error (which is why it isn't applied
> to the other side - since it's already there as part of the
> calculation).
>
> Now, again, it's been far too long, and I may be misremembering
> details, and the correction may very well be unnecessary or otherwise
> erronious! But I think it deserves a bit more evidence/rationale in
> the commit message as to why.

Thanks. This has been making my brain hurt for most of the last week,
but I think I finally have a handle on it.

It looks like we track three different times (ignoring units):

• A: The xtime that we actually output to the vDSO/etc.

• B: xtime+ntp_error is the time we *want* to be outputting right now,
but the mult dithering and monotonicity clawback keep us from it.

• C: xtime+ntp_error+time_offset is the time we *eventually* want to
output, once we've finished skewing towards it.

On each tick:

• xtime (A) advances by xtime_interval (and the clawback; there's
another patch in my tree to account for that in ntp_error now).

• (B) advances by whatever tick_length happens to be at the moment
(adjusted by second_overflow to effect a skew).

• (C) advances by the original tick_length_base actually set according
to the adjtimex frequency.

So ntp_error, being the delta between (B) and (A), needs to advance by
tick_length - xtime_interval. Before this patch, xtime_remainder was
*also* being subtracted from the 'what xtime advanced' side, but it
isn't actually added to xtime; it *is* roughly the amount that needed
to be accumulated in ntp_error here (except for the fact that
xtime_remainder was calculated once at boot time and never updated).

I spent a while booting kernels in QEMU with a VMClock reference clock
precisely specifying the TSC to real-time relationship for (C), and
tracking the *nanosecond* delta of the output from what it was told.

I made it redundantly track (B) and (C) above as absolute values, so
that I could spot per-tick when the accounting of the existing *deltas*
in ntp_error and time_offset went astray (and compare C with the actual
refclock, to check that too).

In my test tree, xtime now correctly dithers precisely around the
desired time (B) and doesn't continually drift from it any more.

And I can even inject, say, a 10μs delta via time_offset and it'll skew
by *exactly* 10,000ns and stay there, still with with single digit
nanosecond divergence from where it's meant to be.

So now I don't *need* the external oracle to drive the dithering per
tick for the reference clock (as I had added in patch 3 in this
series), because the kernel can actually stay on the y=mx+c line
configured by tick_length and ntp_error/time_offset all by itself, so
all I have to do is set the *existing* parameters in
timekeeping_set_reference().

(Arithmetic precision would *eventually* catch up with it, of course,
but in reality it won't be following a hard-coded refclock and the
reference itself will be periodically adjusted as the real counter
varies, and reclamping will occur.)

I don't much like the complexity of the way that time_offset skews
ntp_tick, and it's somewhat redundant with the mult dithering that came
later. I'm playing with folding both time_offset and ntp_error into a
single 'phase_error' and doing the larger skew purely by adjusting
mult, like the dithering does... but by larger values. That's a WIP at
the top of my current working tree:

David Woodhouse (6):
timekeeping: Remove xtime_remainder from ntp_error accumulation
timekeeping: Account for clawback adjustment in ntp_error
timekeeping: Clamp time_offset delta to prevent infinite tail
timekeeping: Add absolute reference for feed-forward clock discipline
ptp_vmclock: Feed reference to timekeeping for feed-forward discipline
timekeeping: Unify ntp_error and time_offset into phase_error

https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/ffclock

I believe everything in there is correctly parameterised and nothing
makes assumptions that are invalid for different CONFIG_HZ and for much
lower-rate timers. I'm almost at the point of being happy enough with
the overall state that I'm ready to start testing those cases too...

Attachment: smime.p7s
Description: S/MIME cryptographic signature