Re: [RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock

From: David Woodhouse

Date: Tue May 19 2026 - 12:07:48 EST

On Tue, 2026-05-19 at 15:16 +0200, Miroslav Lichvar wrote:
> On Sun, May 17, 2026 at 10:25:37PM +0100, David Woodhouse wrote:
> > The vmclock device (https://uapi-group.org/specifications/specs/vmclock/)
> > provides a shared memory page containing a linear time function:
> > time = base + (counter - counter_value) × period. The guest can read
> > this at any time to determine the hypervisor's view of the current time,
> > without a VM exit.
>
> That sounds nice.

The design has two major purposes:

• Atomically letting the guest know that live migration has perturbed
its clock. Without this, some distributed databases which rely on
precision timestamps on transactions for eventual coherency were
getting corrupted when guests were live migrated.

• Avoiding the redundant work of having *hundreds* of guests on the
same host *all* calibrating the same underlying oscillator, while
enjoying the added fun of steal time as they're trying to to so.

Right now, the implementations in both QEMU and the EC2 Nitro
Hypervisor only implement part 1, the disruption signal.

I plan for QEMU to use the vmclock_host driver from this series, along
with the QEMU patch I linked, to expose the host's real time clock
guests to follow.

For dedicated hosting environments like EC2, we don't care very much
about the host's timekeeping; that host kernel exists *only* to host
KVM guests. The host userspace can ignore the host's timekeeping
completely and manage the relationship of the counter to real time
directly — and in some cases will have hardware which will latch the
actual CPU's counter at the moment of a 1PPS signal. We'll feed that
counter-to-realtime information *directly* to guests.

(And will probably export timekeeping_set_reference() via a syscall of
some kind so that we *can* set the host's clock from it too, if I can't
find a way to precisely do so through adjtimex.)

> > The existing ptp_vmclock driver already exposes this as a PTP clock for
> > userspace consumers (phc2sys, chrony). This series adds kernel-internal
> > consumption: the tick mechanism can clamp directly to the vmclock
> > reference, eliminating the need for NTP to discipline the guest clock.
>
> I'm not very familiar with the VM timekeeping and other code. If I
> understand this idea correctly, by loading the ptp_vmclock module the
> guest kernel is giving the host control of its clock.

Right *now*, the ptp_vmclock module is only providing a PTP clock for
userspace to discipline the kernel against, as noted above. But yes,
the intent of what I'm doing here is to bypass all that complexity and
manage the explicit counter-to-time relationship *directly* within the
guest kernel.

I did briefly play with simulating 1PPS, and injecting PPS events at
the precise time that a PPS signal *would* have triggered, to the
cycle:
https://lore.kernel.org/all/87cb97d5a26d0f4909d2ba2545c4b43281109470.camel@xxxxxxxxxxxxx/

> Changes in the host's REALTIME/MONOTONIC clock frequency are mirrored
> to the guest's clock.

Strictly, "changes in the realtime clock frequency advertised by the
vmclock device", but basically yes.

> Differences larger than 100 milliseconds are corrected by step,
> whether the guest applications like it or not. Smaller steps and
> errors accumulated due to a delay in the frequency update (is there a
> limit to this delay?) are corrected by the kernel NTP PLL (with the
> default time constant?).

That behaviour isn't set in stone for vmclock; I'm still only
experimenting with the part where it *can* set the frequency, and an
offset that the kernel will converge to and *stay* on.

Right now it just calls my ntp_set_time_offset() which doesn't step at
all, and always just injects via ->time_offset (the NTP PLL). Much the
same as legacy adjtime() AIUI.

> When the guest is migrated to a different host, the frequency offset
> between the two hosts is injected to the NTP frequency (assuming
> REALTIME clocks of the hosts have zero frequency error at that
> moment?).

When the advertised frequency changes (either due to the ongoing clock
discipline on the host, or because of migration to a new host), the new
frequency is injected directly into tick_length.

> Have you considered a different approach that would address the
> problem with frequency step by adjusting the guest's clocksource
> frequency to match the original host? That would correct all system
> clocks, i.e. not only REALTIME/MONOTONIC, but also MONOTONIC_RAW and
> AUX clocks.

You mean TSC scaling to change the frequency of the actual counter?

When stepping between non-identical hosts, that might be helpful. But
we still have to deal with the variance of the counter over time even
without migration in the picture.

> The guest would still be in control of its clock and follow its own
> preferences to stepping, maximum frequency errors, etc. It could still
> compare the stability and accuracy of the host's clock and use it for
> synchronization only when it's actually better than other available
> time sources (some VPS providers are known to have poorly synchronized
> host clocks).

I think that mode is already available as a PTP clock, isn't it?

While of course it should be optional for the guest, I'm deliberately
optimising for the case here where the hosting provider *does* get it
right and *can* be trusted.

> An AUX clock could be used to more accurately compare
> frequencies of the two hosts, ignoring phase corrections.
>
> There is a work in progress for chrony to support MONOTONIC_RAW as the
> main clock. It would be nice if that could be corrected in migrations.

Not sure I understand this. I thought the whole point of MONOTONIC_RAW
is that it *isn't* skewed by NTP?

> That seems to be a common cause of disruptions of public NTP servers.
> Polling for notifications about clock changes caused by migrations and
> system suspend+resume would be useful in any case.

That much you can do today with /dev/vmclock even when it isn't
exposing the actual time information.

Timekeeping in migration is fairly hosed in KVM. I don't think there
are many implementations that actually set the TSC correctly on the
destination host. But that's a different story...

Attachment: smime.p7s
Description: S/MIME cryptographic signature