Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64

From: Jordan Rhee

Date: Thu May 21 2026 - 00:35:29 EST

Hi Thomas,
Thank you for the review. We plan to rework this to use
https://lore.kernel.org/all/20260515164033.6403-1-akiyano@xxxxxxxxxx/,
which will let us use the midpoint of pre and post while still
propagating the uncertainty window up the stack. I am going to address
some of the concerns you raised simply for the record, not because I
am trying to convince you one way or another. I do not expect a
response. Thank you,
Jordan

On Wed, May 20, 2026 at 2:08 PM Thomas Gleixner <tglx@xxxxxxxxxx> wrote:
>
> On Thu, May 14 2026 at 22:58, Harshitha Ramamurthy wrote:
> > From: Jordan Rhee <jordanrhee@xxxxxxxxxx>
> >
> > Enable chrony and phc2sys to synchronize system clock to NIC clock.
> >
> > Two paths are implemented: a precise path using system counter values
> > sampled by the device, and a fallback path using system counter values
> > sampled in the driver using ptp_read_system_prets()/postts().
> >
> > To use the precise path, the current system clocksource must match the
> > units returned by the device, which on x86 is X86_TSC and on ARM64 is
> > ARM_ARCH_COUNTER. The clockid requested for the cross-timestamp must
> > be either CLOCK_REALTIME or CLOCK_MONOTONIC_RAW. These conditions hold
> > by default on GCP VMs using Chrony, so we expect the precise path to be
> > used the vast majority of the time. If the system clocksource is changed
> > to kvm-clock, it activates the fallback path. Ethtool counters have been
> > added to count how many times each path is used.
> >
> > The uncertainty window in the precise path is typically around 1-2us,
> > while in the fallback path is around 60-80us.
> >
> > Stub implementions of adjfine and adjtime are added to avoid NULL
> > dereference when phc2sys tries to adjust the clock.
>
> Sorry that I did not react to this earlier, but I was burried in
> regressions and allowed myself to take a few days off.
>
> > +/*
> > + * Read NIC clock by issuing the AQ command. The command is subject to
> > + * rate limiting and may need to be retried. Requires nic_ts_read_lock
> > + * to be held.
> > + */
> > +static int gve_ptp_read_timestamp(struct gve_ptp *ptp, cycles_t *pre_cycles,
> > + cycles_t *post_cycles)
> > +{
> > + unsigned long deadline = jiffies + msecs_to_jiffies(100);
> > + unsigned long delay_us = 1000;
> > + int err;
> > +
> > + lockdep_assert_held(&ptp->nic_ts_read_lock);
> > +
> > + do {
> > + *pre_cycles = get_cycles();
>
> This is a completely non-sensical hack. Why do you need get_cycles()
> here, which is a horrible and ill-defined interface that others are
> trying to get rid of?

We use get_cycles() purely as a sanity check against the TSC values
returned by firmware. We do not use them in the time conversion.
This check found a serious correctness bug in the firmware.

>
> Just because it's still there is not a good answer and it's actually not
> required at all. See below.
>
> > + err = gve_adminq_report_nic_ts(ptp->priv,
> > + ptp->nic_ts_report_bus);
> > +
> > + /* Prevent get_cycles() from being speculatively executed
> > + * before the AdminQ command
> > + */
> > + rmb();
> > + *post_cycles = get_cycles();
> > + if (likely(err != -EAGAIN))
> > + return err;
> > +
> > + fsleep(delay_us);
> > +
> > + /* Exponential backoff */
> > + delay_us *= 2;
> > + } while (time_before(jiffies, deadline));
> > +
> > + return -ETIMEDOUT;
> > +}
> > +
> > /* Read the nic timestamp from hardware via the admin queue. */
> > -static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw)
> > +static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw,
> > + struct gve_sysclock_sample *sysclock)
> > {
> > + cycles_t host_pre_cycles, host_post_cycles;
> > + struct gve_nic_ts_report *ts_report;
> > int err;
> >
> > mutex_lock(&ptp->nic_ts_read_lock);
>
> Why is this lock taken _after_ the regular (fallback) pre/post
> timestamps are aquired?
>
> What's wrong with doing:
>
> do {
> guard(mutex)(&ptp->nic_ts_read_lock);
>
> ptp_read_system_prets(sts);
> err = read_nic_timestamp(...);
> ptp_read_system_postts(sts);
> } while (err && !timeout(...));
>
> return err;
>
> which moves the "fallback" timestamp right next to the actual hardware
> readout?

The lock is taken at the smallest scope possible, which does not
include the pre/post TS reading. This routine is called by multiple
callers, not all of which need the pre/post timestamp. Options are to:
- move the pre/post TS reading here and add branches to conditionally
sample them only when needed
- always capture the pre/post TS readings and discard them when not needed
- duplicate the locking logic, once for path that needs pre/post TS,
once for path that doesn't need pre/post TS

Since the fallback path is expected to be rarely used, lock contention
and retries are expected to be extremely rare, and chrony is resilient
to samples with high uncertainty window, I chose to prioritize
cleanliness of the code and keep the pre/post TS reading in the caller
where it is actually needed. That said, I do not feel strongly about this
and would be open to moving pre/post inside the mutex if I were sending
another revision.

>
> That would make the "fallback" case too precise, right?

Moving pre/post inside the mutex makes no measurable difference
because the lock is rarely contended. There is 4 orders of
magnitude difference in RMS offset between precise and fallback paths.

>
> After noticing the obvious, I just checked what our new AI review
> overlord "sashiko" has to say about this:
>
> "Does wrapping the timestamp read outside the locks and retry loops skew the
> measurements?
>
> In the fallback path, ptp_read_system_prets() is captured before calling
> gve_clock_nic_ts_read(), and postts() is captured after it returns. However,
> gve_clock_nic_ts_read() acquires ptp->nic_ts_read_lock and calls
> gve_ptp_read_timestamp(), which can sleep using fsleep() during retries. This
> means the prets/postts window could include mutex wait times and sleep
> durations."
>
> It _IS_ actually on spot. So why the heck did this garbage get merged
> without addressing this obvious fail?

- The fallback path is expected to be rarely used since it requires
changing the default configuration
- The lock is rarely contended as there are only 2 callers, one of
which only executes once every 250ms,
and retries would only occur if a VM exceeded their rate limits
which have been tuned not to occur
under normal chrony calling patterns.

>
> What "sashiko" actually saw aside of that, which I didn't see myself when
> looking at this, is:
>
> "Similarly, in gve_ptp_read_timestamp(), the get_cycles() bounds are taken
> outside the gve_adminq_report_nic_ts() call, which acquires the shared
> priv->adminq_lock. If there is lock contention, the cycle bounds could become
> extremely wide."
>
> Oh well.

get_cycles() is only used for sanity checking the values from
firmware, not in any time conversion.

>
> So what you really want is:
>
> do {
> guard(mutex)(&ptp->nic_ts_read_lock);
> guard(mutex)(&gpe_priv->adminq_lock);
>
> ptp_read_system_prets(sts);
> err = read_nic_timestamp(...) {
> ...
> return gve_adminq_execute_cmd_locked(....);
> }
> ptp_read_system_postts(sts);
> } while (err && !timeout(...));
>
> return err;
>
> Or just rely on gpe_priv->adminq_lock in the first place. No?

adminq_lock is private to the adminq layer and protects adminq shared data
structures. nic_ts_read_lock protects the global DMA buffer that receives
the timestamps. Since retries are a response to rate limiting, we must wait
between retries, so we do not want to hold the adminq lock across
these retries as it would block other non rate-limited commands.

>
> But what's worse and escaped the AI scrutiny is:
>
> Your attempt to convert these 'get_cycles()' time stamps after the
> fact and outside of the timekeeping core protection is fundamentally
> wrong.
>
> get_device_system_crosststamp() does
>
> do {
> seq = read_seqcount_begin(&tk_core.seq);
> take_timestamps(...);
> convert_timestamps(...)
> } while (read_seqcount_retry(&tk_core.seq, seq));
>
> which covers the whole sequence of taking the snapshots from the
> hardware and the conversion.
>
> Your abuse of this is completely out of spec. Just because it did not
> blow up in your face yet does not make it in any way correct. You do not
> even get bonus points for creative abuse.

We use ktime_get_snapshot() to capture a historical snapshot before sampling
the TSC, and we pass it to get_device_system_crosststamp() which
uses the snapshot to interpolate the sample if it occurs before the current
timekeeping interval. Perfectly within spec.

>
>
> Now let me look at the "firmware" provided timestamps.
>
> That's the poor mans emulation of actual hardware timestamps aka PTM.
>
> Why do you need all this pre_nic/post_nic muck there? Fix your firmware
> to actually honor PTM and not provide some made up version of it.

This has to run on platforms that don't support PTM.

>
> Also the whole debug mechanism of comparing the time stamps of
> get_cycles() with the timestamps provided by firmware tells me that your
> firmware is a hacked together piece of garbage.
>
> If you want to use "hardware" timestamps then populate the
> ptp.info.getcrosststamp() callback.
>
> If your firmware really can't do PTM and only provides pre and post
> values, then you simply can take the midpoint in your getcrosststamp()
> callback:
>
> do {
> guard(mutex)(&ptp->nic_ts_read_lock);
> err = read_nic_timestamp(...);
> system_counterval->cycles = (nic_pre + nic_post) / 2;
> } while (err && !timeout(...));

We considered returning the midpoint of pre/post up through
getcrosststamp(), but there was
no way to communicate the uncertainty interval which is important
for some of our use cases.
https://lore.kernel.org/all/20260515164033.6403-1-akiyano@xxxxxxxxxx/
fixes that issue, so we can use that when it lands.

>
> If you do not even trust your firmware then you still can validate it
> easily without creating horrible hacks:
>
> do {
> guard(mutex)(...);
>
> ptp_read_system_prets(&ctx->sts);
> err = read_nic_timestamp(...);
> system_counterval->cycles = (nic_pre + nic_post) / 2;
> ptp_read_system_postts(&ctx->sts);
> } while (err && !timeout(...));
>
> and then check the ctx->sts timestamps against the results provided by
> get_device_system_crosststamp() after conversion in your
> ptp.info.getcrosststamp() callback:
>
> err = get_device_system_crosststamp(..., &ctx);
> if (err)
> return err;
> if (ctx.sts.pre_ts > xtstamp.sys... || ctx.sts.post_ts < xtstamp.sys...)
> return -EBROKENFIRMWARE;
>
> no?

get_device_system_crosststamp() has nontrivial logic that operates on the system
timestamps, so I am not confident that a bad TSC value would be detected
if we ran it through get_device_system_crosststamp() first. The check
is simplest
and strongest when performed against the raw TSC values received from the
device, before passing them to any kernel API. But I hear your point
about mistrusting
the firmware enough to need this check.

>
> Jakub, please back out this hackery before it makes a precedence for others
> to copy from.
>
> Thanks,
>
> tglx