Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()

From: Dongli Zhang

Date: Tue May 12 2026 - 19:24:28 EST




On 5/11/26 10:21 PM, David Woodhouse wrote:
> On Mon, 2026-05-11 at 17:16 -0700, Dongli Zhang wrote:
>>
>>
>> On 5/9/26 5:22 AM, David Woodhouse wrote:
>>> On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
>>>> The pvclock_update_vm_gtod_copy() function always unconditionally updates
>>>> ka->master_kernel_ns and ka->master_cycle_now whenever a
>>>> KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
>>>> increases the risk of kvm-clock drift.
>>>>
>>>> If pvclock_update_vm_gtod_copy() is not called from
>>>> vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
>>>> workflow. The argument 'forced' is introduced to tell where it is from.
>>>>
>>>> Otherwise, we avoid updating the masterclock if it is already
>>>> active and will remain active. In such cases, updating the masterclock
>>>> data is not beneficial and can instead lead to kvm-clock drift.
>>>>
>>>> As a result, this patch minimizes the chance of unnecessary masterclock
>>>> data updates to avoid kvm-clock drift.
>>>>
>>>> Cc: David Woodhouse <dwmw@xxxxxxxxxxxx>
>>>> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
>>>
>>> Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
>>> set the 'force' argument is the one in kvm_update_masterclock(), so we
>>> are asserting that kvm_update_masterclock() never needs to *change* the
>>> masterclock origin point, if it was already set?
>>>
>>> The gtod notifier callback in pvclock_gtod_update_fn() also ends up
>>> setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
>>> timekeeping update (which could potentially be from a clocksource
>>> change). It also hypothetically possible that the clocksource changes
>>> from TSC → HPET → TSC, switching back to TSC again before the
>>> masterclock update ever gets to run. Or maybe a suspend/resume?
>>>
>>> Are you *sure* that the optimisation is always valid...?
>>
>> Thank you very much!
>>
>> I didn't validate the scenario you mentioned. I missed that scenario because I
>> assumed that most production systems nowadays use STABLE/CONSTANT/NONSTOP TSC as
>> the host clocksource, although I sometimes forgot to add "clocksource=tsc
>> tsc=reliable" to my AMD L1 KVM guest (acting as the hypervisor for L2 guest).
>
> I'd love to assume that, but we do still have to cater for systems
> without it (or where it goes wrong). And where userspace sets up vCPUs
> with different frequencies... although I'd quite like to ban that too
> :)
>
>> I didn't follow up on this patch because I noticed another issue. I found that
>> the tsc_timestamp in the PVTI can become a very large number if we simply reboot
>> the guest VM. This happens because the patch stops updating the masterclock data
>> when the masterclock is already active and remains active.
>>
>> For example:
>>
>> current guest TSC: 122763682
>> PVTI->tsc_timestamp = 18446744073656540006
>> PVTI->system_time=196515164269
>>
>> Although I could not reproduce any bug, that still made me feel uncomfortable.
>
> That's just negative (-53011610). Theoretically it's OK; it's just a
> consequence of using a reference point in the past. Probably just
> *asking* for guest bugs though, so best avoided.
>
> I'd like to understand how it got like that though. Did your 'reboot'
> reset the guest TSC to zero but *not* its kvmclock?
>

Taking mainline QEMU as an example, I simply trigger a reboot from the guest VM.

A normal reboot from the guest VM resets only MSR_IA32_TSC, but not
KVM_SET_CLOCK. Before the reboot, the masterclock is enabled. During
kvm_synchronize_tsc() for each vCPU, the masterclock remains enabled as well.
Therefore, pvclock_update_vm_gtod_copy() does not update the masterclock values
because of this patchset.

As a result, tsc_timestamp eventually becomes negative.

For example, I apply the patch below and this patchset on top of the latest
Linux kernel (as KVM).

https://lore.kernel.org/kvm/20260115202256.119820-1-dongli.zhang@xxxxxxxxxx

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e88310f5979..d9f93cdb46ab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3158,8 +3158,13 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm,
bool forced)
if (forced || !(use_master_clock && ka->use_master_clock)) {
ka->master_kernel_ns = master_kernel_ns;
ka->master_cycle_now = master_cycle_now;
+ pr_alert("debug: pvclock_update_vm_gtod_copy() update (%llu,
%llu)\n",
+ ka->master_kernel_ns, ka->master_cycle_now);
}

+ pr_alert("debug: pvclock_update_vm_gtod_copy() old=%d, new=%d\n",
+ ka->use_master_clock, use_master_clock);
+
ka->use_master_clock = use_master_clock;

if (ka->use_master_clock)
@@ -3416,6 +3421,9 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
vcpu->last_guest_tsc = tsc_timestamp;

+ pr_alert("debug: kvm_guest_time_update() vcpu=%u tsc=%llu time=%llu\n",
+ v->vcpu_id, hv_clock.tsc_timestamp, hv_clock.system_time);
+
/* If the host uses TSC clocksource, then it is stable */
hv_clock.flags = 0;
if (use_master_clock)
@@ -4108,6 +4116,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
msr_data *msr_info)
vcpu->arch.msr_ia32_power_ctl = data;
break;
case MSR_IA32_TSC:
+ pr_alert("debug: set MSR_IA32_TSC vcpu=%u, host=%d, data=%llu\n",
+ vcpu->vcpu_id, msr_info->host_initiated, data);
if (msr_info->host_initiated) {
kvm_synchronize_tsc(vcpu, &data);
} else if (!vcpu->arch.guest_tsc_protected) {


Below are tsc_timestamp and system_time before reboot.

[ 154.159733] kvm: debug: kvm_guest_time_update() vcpu=3 tsc=97471036 time=4759583
[ 154.160101] kvm: debug: kvm_guest_time_update() vcpu=2 tsc=97471036 time=4759583
[ 154.160658] kvm: debug: kvm_guest_time_update() vcpu=1 tsc=97471036 time=4759583
[ 154.161292] kvm: debug: kvm_guest_time_update() vcpu=0 tsc=97471036 time=4759583

After reboot, there will be TSC synchronization.

[ 154.217551] kvm: debug: set MSR_IA32_TSC vcpu=0, host=1, data=1
[ 154.217980] kvm: debug: set MSR_IA32_TSC vcpu=1, host=1, data=1
[ 154.218384] kvm: debug: set MSR_IA32_TSC vcpu=2, host=1, data=1
[ 154.218792] kvm: debug: set MSR_IA32_TSC vcpu=3, host=1, data=1

The masterclock remains enabled. Therefore, pvclock_update_vm_gtod_copy() does
not update the masterclock values because of this patchset.

tsc_timestamp becomes negative, as TSC is reset to start from data=1.

[ 154.219283] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[ 154.219671] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[ 154.504499] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[ 154.504898] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[ 154.504898] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[ 154.504898] kvm: debug: kvm_guest_time_update() vcpu=3
tsc=18446743945549702059 time=4759583
[ 154.505240] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[ 154.505240] kvm: debug: kvm_guest_time_update() vcpu=1
tsc=18446743945549702059 time=4759583
[ 154.505241] kvm: debug: kvm_guest_time_update() vcpu=2
tsc=18446743945549702059 time=4759583


Usually, QEMU users (i.e., libvirt) reboot the guest VM from outside, that is:

1. Send shutdown to guest VM.
2. (qemu) system_reset
3. (qemu) cont

The cont command triggers KVM_SET_CLOCK, so we do not get a negative tsc_timestamp.

That is why I did not follow up on this patch set further.

Thank you very much!

Dongli Zhang