Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock

From: David Woodhouse

Date: Wed May 20 2026 - 08:31:52 EST

On Wed, 2026-05-20 at 12:39 +0200, Miroslav Lichvar wrote:
> On Tue, May 19, 2026 at 04:50:41PM +0100, David Woodhouse wrote:
> > The design has two major purposes:
>
> > • Avoiding the redundant work of having *hundreds* of guests on the
> > same host *all* calibrating the same underlying oscillator, while
> > enjoying the added fun of steal time as they're trying to to so.
>
> But isn't that work still duplicated, only moved to the kernel?

Not the actual calibration of the TSC against real time, no. It is the
*host* which gets the 1PPS signal and does all the work of tracking and
smoothing the frequency drift over time. The guest basically gets the
same as a vDSO, *telling* it a relationship from TSC to real time.

Many guests in trustworthy hosting environments will just use that and
want to feed it directly to the guest kernel timekeeping. Others might
want to take a more opinionated stance, as you describe below. Those
probably *would* duplicate some of the effort, in order to form their
opinion.

> The userspace part could be a simple loop waiting for vmclock
> notifications and following the changes of the host. The only
> difference would be a longer delay, but still insignificant for the
> intended purpose, right?
>
> I don't like the idea of adding more clock control loops to the kernel
> much.

I completely agree. I am absolutely not planning to add any more clock
control to the kernel than we already have. As you say, we probably
have too many already.

> It's a complexity that will likely grow as different
> requirements come and the code will be even more difficult to
> understand. IMHO the NTP PLL and hard PPS loops shouldn't have been
> included in the kernel. The kernel time control API should have been
> just setting/stepping the time and changing the frequency, both possibly
> at a specified time instead of the time of the call.

There is merit in that argument.

The kernel already has a separation between the core timekeeping code
in timekeeping.c and the rest of the NTP code in ntp.c which does the
higher level control.

The timekeeping_set_reference() added in my patch *only* uses the
existing basic timekeeping code, taking the vDSO-like information that
I mentioned above, and using it to set the frequency and offset for the
kernel's core timekeeping to follow.

There's a cleaner version in my tree now, because having fixed all the
errors in the core timekeeping which were introducing drift, the
implementation of timekeeping_set_reference() can be a *whole* lot
simpler than it was in my initial proof of concept — it now really can
just set the tick length and time_offset, and let it run:
https://git.infradead.org/?p=users/dwmw2/linux.git;a=commitdiff;h=c62bf50eca

> > > Have you considered a different approach that would address the
> > > problem with frequency step by adjusting the guest's clocksource
> > > frequency to match the original host? That would correct all system
> > > clocks, i.e. not only REALTIME/MONOTONIC, but also MONOTONIC_RAW and
> > > AUX clocks.
> >
> > You mean TSC scaling to change the frequency of the actual counter?
>
> Yes, in hardware if available, or in software if not. An additional
> 32-bit multiplier applied like this:
>
> cycles += (cycles * mult) >> shift
>
> Larger adjustments can be done in the normal multiplier for all clocks.
>
> > When stepping between non-identical hosts, that might be helpful. But
> > we still have to deal with the variance of the counter over time even
> > without migration in the picture.
>
> Whatever is synchronizing the guest clock to the host (using the PHC
> or vmclock page) will take care of that? The point is to avoid
> migrations causing a frequency step.
>
> I'm not sure what identical and non-identical hosts mean in this
> context, same nominal CPU frequency, or a CPU tied to the same crystal
> oscillator?

I meant same nominal frequency.

I'm not sure what scaling the guest TSC would buy us. Sure, it would
minimise the frequency step at the moment of migration, but a naïve
guest which isn't using vmclock's disruption signal is screwed on live
migration *anyway*, because there's *also* a step change in the actual
TSC value which is bounded by the real time synchronization of the
source and destination host.

Anything the guest has done for itself to calibrate the source host's
TSC must be entirely thrown away on migration. The vmclock allows the
destination host to immediately say "here, use this instead".

AFAICT scaling the TSC would just add complexity and wouldn't help
much.

And TSC scaling is pretty much x86-specific; other architectures have a
*defined* counter frequency and don't need to support scaling.

I'm not a fan :)

> > > The guest would still be in control of its clock and follow its own
> > > preferences to stepping, maximum frequency errors, etc. It could still
> > > compare the stability and accuracy of the host's clock and use it for
> > > synchronization only when it's actually better than other available
> > > time sources (some VPS providers are known to have poorly synchronized
> > > host clocks).
> >
> > I think that mode is already available as a PTP clock, isn't it?
>
> Yes, but it's slow due to missing frequency transfer, not feed-forward
> as you call it. The host's frequency offset could be exposed in the
> PHC's timex.

Yes, that makes a lot of sense.

You can literally open /dev/vmclock and consume it *however* you like
from userspace. You can even poll() and get woken when there's an
update. I think that would be a great thing for chrony to learn to do
(and that's how you get the disruption signal too).

> > > There is a work in progress for chrony to support MONOTONIC_RAW as the
> > > main clock. It would be nice if that could be corrected in migrations.
> >
> > Not sure I understand this. I thought the whole point of MONOTONIC_RAW
> > is that it *isn't* skewed by NTP?
>
> It isn't adjusted, but it can be used as a stable reference avoiding
> the multiplier-induced jitter, interference from other processes, and
> synchronization loops, e.g. when an NTP client is synchronizing to an
> NTP server running on the same system (in different containers).

We could just use the TSC for this, insted of MONOTONIC_RAW, couldn't
we? Do all our clock discipline of the *TSC* against the external
sources, and then use the same timekeeper_set_reference() to ask the
kernel's core timekeeping to track the TSC-to-realtime relationship
that we desire?

That's exactly what I'm planning to do for a dedicated hosting
environment. I think the patches which allow PTP to return paired
timestamps with reference to TSC instead of CLOCK_MONOTONIC landed in
the net-next tree today?

(for TSC, read 'arch counter, timebase, etc.' — none of this is x86-
specific but 'TSC' is quicker to type...)

Attachment: smime.p7s
Description: S/MIME cryptographic signature