Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64

From: David Woodhouse

Date: Thu May 21 2026 - 16:00:02 EST


On Thu, 2026-05-21 at 20:06 +0200, Thomas Gleixner wrote:
> On Thu, May 21 2026 at 12:43, David Woodhouse wrote:
>
> > With PTM and the various virt enlightenments, there are an increasing
> > number of "PTP" clocks which literally *just* tell us a cycle count at
> > the moment of the corresponding reading, and we don't need the ABA
> > sandwich of local clock readings around it any more.
>
> That's what the PRECISE ioctl and ptp::info::getcrosstimestamp() is for,
> which utilizes get_device_system_crosststamp() and provides the
> translation from PTM, on x86 that's ART which is converted to TSC.
>
> It then correlates the converted system time stamp to CLOCK_REALTIME and
> CLOCK_MONOTONIC_RAW.
>
> That has been implemented long ago exactly for that purpose, but people
> need to reinvent the wheel over and over just because it makes the
> kernel so more maintainable.
>
> > I've got another one waiting in the wings which will literally just
> > latch the counter value when a PPS signal comes in, too.
>
> That can use a similar mechanism.

Yep. In both cases, some userspace actually wants the *counter value*.

> > It might be good to harmonise the way we convert to KVM clock if we're
> > going to support that (although I'll accept an argument that any guest
> > which chooses to use the KVM clock doesn't *deserve* to have accurate
> > timekeeping).
>
> :)

I'm not sure I'm joking. The KVM clock is.... *hosed*. It was invented
to work around the ancient broken variable-speed TSCs which stopped on
HLT, etc.

In *those* days, sure it made sense for the hypervisor to constantly
update a data structure which let you approximate a TSC→nanoseconds
conversion.

These days, a guest using the *actual* TSC is going to be a lot more
accurate, and guests basically shouldn't be using the kvmclock AT ALL.

When I made it DTRT for ptp_vmclock, I did look briefly at what it
would take to do it centrally in common code.

But that lumbers the common PTP code with x86 arch-specific horridness
about the KVM clock. And it literally has to call back into the driver
in the seqcount loop, so I guess drivers would need to have a flag or
something that says it's OK to do that (they're fast enough, or
something?).

I didn't spent *that* long trying, but I couldn't see a way of doing it
centrally, which didn't end up making me sadder than the initial
duplication.

Honestly, whether it was intent or omission, I suspect the approach
Harshitha took for GVE is the right one: Support TSC, and anyone who
uses KVM clock doesn't deserve accurate time anyway.

> > I'd also like consensus on exposing it to userspace. I think Thomas is
> > going to suggest that we should always convert to MONOTONIC_RAW, but
> > what I asked Arthur to do in
> > https://lore.kernel.org/all/20260515164033.6403-1-akiyano@xxxxxxxxxx/
> > exposes the raw counter values instead.
>
> I'm not against that per se, but that really wants to be a consistent
> snapshot and not some pulled out of thin air cycles value.

Definitely.

> ktime_get_snapshot() provides that already today, so that can be
> arguably used for pre/post timestamp mechanisms, except that it lacks
> support for AUX clocks.
>
> get_device_system_crosststamp() does not provide cycles and clock id,
> but that's not rocket science to extend.

Yep.

> > From the dedicating hosting point of view, we *only* care about the
> > counter, and absolutely DGAF about the host's timekeeping. Migrating
> > KVM guests requires getting the guest TSC right (in terms of scaling
> > factors and offset from the host TSC), and feeding accurate time to
> > guests is done in terms of the guest TSC to realtime relationship.
>
> The offset has nothing to do with the scaling. It's the delta to the TSC
> value which the migrated guest saw last when it was packed up for
> migration.

Sure. As long as that guest TSC value is a consistent snapshot and not
some pulled out of thin air cycles value :)

There is lots to be said about accurate KVM migration, and a series of
30-odd patches on the KVM list which attempts to improve it.

But either way, we still DGAF about the host's CLOCK_REALTIME when
we're doing it at scale.

KVM doesn't *have* a clean way to get/set the guest TSC based on
realtime; it only has the scaling and offsets from the host TSC. So to
do things precisely, we end up converting via the host TSC.

So we end up comparing a {host TSC, realtime} pair between source and
destination hosts, calculating the *guest* ticks that should have
occurred in the intervening time, and setting the "offset from host" on
the destination host to achieve the desired results. Almost all of
which is in terms of TSC cycles and offsets.

> > So our ideal state is that we discipline the TSC against real time in
> > userspace, and enabling those PHC drivers to *give* us the raw cycle
> > counts that they actually measured seemed best.
>
> As I said, I'm not against that per se, but then we want to provide
> interfaces which are generic enough so they can be used for your
> purposes, but also for the requirements of chrony et al. Looking at the
> patch you linked to above:
>
> > +static long ptp_sys_offset_precise_attrs(struct ptp_clock *ptp, void __user *arg)
> > +{
> > + struct ptp_sys_offset_precise_attrs precise_offset_attrs;
> > + struct system_device_crosststamp xtstamp;
> > + struct ptp_clock_attributes att = {};
> > + struct timespec64 ts;
> > + int err;
> > +
> > + if (!ptp->info->getcrosststampattrs)
> > + return -EOPNOTSUPP;
> > +
> > + err = ptp->info->getcrosststampattrs(ptp->info, &xtstamp, &att);
> > + if (err)
> > + return err;
> > +
> > + memset(&precise_offset_attrs, 0, sizeof(precise_offset_attrs));
> > + ts = ktime_to_timespec64(xtstamp.device);
> > + precise_offset_attrs.device.pct.sec = ts.tv_sec;
> > + precise_offset_attrs.device.pct.nsec = ts.tv_nsec;
> > + precise_offset_attrs.device.att.error_bound = att.error_bound;
> > + precise_offset_attrs.device.att.timescale = att.timescale;
> > + precise_offset_attrs.device.att.status = att.status;
> > + precise_offset_attrs.device.att.counter_id = att.counter_id;
> > + precise_offset_attrs.device.att.counter_value = att.counter_value;
>
> That's exactly the hackery which is counterproductive. Why?

That's the same pattern as in the existing ptp_sys_offset_precise(),
isn't it? Copying from the internal data structures it got from the
driver, into the specific userspace structure for the ioctl that was
called.

> Where does ptp::info::getcrosststampattrs() retrieve the system counter
> value from?
>
> Either it has to convert from the ART based timestamp to TSC magically
> by circumventing the core code or it has to use get_cycles() separately
> which defeats the whole purpose of a hardware latched combo timestamp.

The point in ptp_sys_offset_precise is that all the timestamps are the
*same* time, isn't it?

So surely a device can only support this if it *does* have a precise
hardware timestamp, e.g. from ART? And the corresponding sys_realtime
and sys_monoraw fields also have to be from the *same* value, surely?

Is that not how it works? If not.... WTF *is* it for?

We're allowed to set *fire* to anyone who just throws a random
get_cycles() call in there, aren't we? That's basically just self-
defence, m'lud.


> While this:
>
> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> index aee2c1a46e47..43a55f618235 100644
> --- a/include/linux/timekeeping.h
> +++ b/include/linux/timekeeping.h
> @@ -296,19 +296,6 @@ struct system_time_snapshot {
>   u8 cs_was_changed_seq;
>  };
>  
> -/**
> - * struct system_device_crosststamp - system/device cross-timestamp
> - *       (synchronized capture)
> - * @device: Device time
> - * @sys_realtime: Realtime simultaneous with device time
> - * @sys_monoraw: Monotonic raw simultaneous with device time
> - */
> -struct system_device_crosststamp {
> - ktime_t device;
> - ktime_t sys_realtime;
> - ktime_t sys_monoraw;
> -};
> -
>  /**
>   * struct system_counterval_t - system counter value with the ID of the
>   * corresponding clocksource
> @@ -325,6 +312,21 @@ struct system_counterval_t {
>   bool use_nsecs;
>  };
>  
> +/**
> + * struct system_device_crosststamp - system/device cross-timestamp
> + *       (synchronized capture)
> + * @device: Device time
> + * @sys_counter: Clocksource counter value simultaneous with device time
> + * @sys_realtime: Realtime simultaneous with device time
> + * @sys_monoraw: Monotonic raw simultaneous with device time
> + */
> +struct system_device_crosststamp {
> + ktime_t device;
> + struct system_counterval_t sys_counter;
> + ktime_t sys_realtime;
> + ktime_t sys_monoraw;
> +};
> +
>  extern bool ktime_real_to_base_clock(ktime_t treal,
>        enum clocksource_ids base_id, u64 *cycles);
>  extern bool timekeeping_clocksource_has_base(enum clocksource_ids id);
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index c493a4010305..11df7e25bdca 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1506,6 +1506,7 @@ int get_device_system_crosststamp(int (*get_time_fn)
>   nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, cycles);
>   } while (read_seqcount_retry(&tk_core.seq, seq));
>  
> + xtstamp->sys_counter = system_counterval;
>   xtstamp->sys_realtime = ktime_add_ns(base_real, nsec_real);
>   xtstamp->sys_monoraw = ktime_add_ns(base_raw, nsec_raw);
>  
> Provides the information coherently and without any extra hacks. No?

Yeah, that looks eminently reasonable. It isn't actually that far off
what Arthur did in patch 1 of the series, is it? He was adding other
clock quality attributes at the same time, and ended up shoe-horning
the cs_id and cycle_count into those instead of using a separate
system_counterval_t as you have here, which *is* cleaner. So yes, let's
get him to do that, but that part is cosmetic?

> Extending get_device_system_crosststamp() for AUX clocks is on my todo
> list for a long time, but I did not come around to it yet. It's not
> really complicated to make that work.
>
> The actual ena_phc_gettimexattrs64() implementation does not provide the
> counter value at all despite the fact that it could trivially do so with
> a modified ktime_get_snapshot() except for AUX clock IDs, but that's a
> solvable problem. See uncompiled PoC below.

Huh? Unless it has PTM and is converting from ART (or is something
virtual like vmclock or the older non-LM-safe KVM hacks), surely the
driver has no business pretending to support 'precise' mode, and no
business providing a counter value?

In this case, we'd want the pre_ts and post_ts to be generated by the
common code using ktime_get_snapshot_id(cs_id) so that we have the
corresponding system_counterval_t in the 'A' parts of the ABA tuple...
but the device part of it can't pull one out of thin air unless it is
*part* of the device reading, surely?

> I really have to ask why we put a lot of effort into consolidated
> infrastructure, when every drivers/foo/ subsystem decides it needs it's
> own absymal hacks just because sitting down and extending the generic
> infrastructure is asked too much. TBH, that's frustrating as hell and
> no, none of these usecases is so special that it would justify a single
> one of these hacks.

Meh. I'm literally here trying to join the dots between timekeeping,
PTP and KVM, and come up with a holistic solution that makes sense for
all the use cases. And my brain hurts... :)


> -EXPORT_SYMBOL_GPL(ktime_get_snapshot);
> +EXPORT_SYMBOL_GPL(ktime_get_snapshot_id);

Ack. I think Arthur needs to use that in the pre_ts/post_ts helpers. Thanks.

Attachment: smime.p7s
Description: S/MIME cryptographic signature