Re: VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX

From: Jim Mattson

Date: Fri Jun 05 2026 - 01:35:43 EST


On Thu, Jun 4, 2026 at 7:56 PM Chao Gao <chao.gao@xxxxxxxxx> wrote:
>
> On Thu, Jun 04, 2026 at 02:59:45PM -0700, Jim Mattson wrote:
> >?
> >
> >On Thu, Jun 4, 2026 at 12:58 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >>
> >> On Wed, Jun 03, 2026, Jim Mattson wrote:
> >> > On Thu, May 14, 2026 at 11:35 PM Chao Gao <chao.gao@xxxxxxxxx> wrote:
> >> > >
> >> > > >> EMR158. VMX-Preemption Timer May Expire Earlier With Certain Large Timer Values
> >> > > >
> >> > > >I assume the same erratum applies to previous generations as well?
> >> > >
> >> > > Yes.
> >> >
> >> > This test still fails on our SKX, CLX, and ICX systems.
> >> >
> >> > Sean,
> >> >
> >> > Were you thinking of enforcing a cap on delta_tsc in vmx_set_hv_timer()?
> >>
> >> Heh, to be honest, I wasn't thinking of a whole lot of nothing. Falling back to
> >> hrtimers does seem like the easiest solution.
> >
> >I think vmx_set_hv_timer() should return -EINVAL for values impacted
> >by this erratum. However, the only documented issue is for EMR, and we
> >have not observed the problem on EMR. That's unsettling.
>
> Could you clarify what tests you ran?

Just tools/testing/selftests/kvm/x86/apic_bus_clock_test.

It fails on SKX, CLX, and ICX. It passes on SPR, EMR, and GMR.

> I am using the reproducer from Yuan:
> https://lore.kernel.org/kvm/20240708055559.rl4w5xfhj3uru6j2@yy-desk-7060/
>
> I write -1 to the VMX preemption timer, do VM-Enter, and have the guest
> execute VMCALL to force a VM-Exit. On VM-Exit, we read back the preemption
> timer. The delta should be very small; otherwise, the platform likely has the
> same issue.
>
> I tested several platforms, including EMR. The results are consistent with the
> erratum, i.e., I observed premature VMX preemption-timer VM-Exits, and the
> documented limit did not trigger premature VMX preemption-timer VM-Exits in my
> testing.
>
> >
> >Chao:
> >
> >1) Should we just assume that all Intel CPUs are affected?
>
> I think that is reasonable unless we have explicit evidence to exclude specific
> parts.
>
> >
> >2) Is there any compelling reason not to simplify the limit to 2^25?
>
> We can use 2^25 as a conservative bound, but it is much lower than necessary.
> The current bound comes from theoretical analysis and was validated on multiple
> platforms.

Yes, but how often do guests program their local APIC timer to fire
more than 2^(25 + IA32_VMX_MISC[4:0]) cycles in the future?

> >
> >3) Is it just coincidence that 25 + IA32_VMX_MISC[4:0] (on EMR) == 32,
> >or should the limit be calculated as 32 - IA32_VMX_MISC[4:0]?
>
> My understanding is that hardware scales the preemption-timer value and
> converts it to a 32-bit core crystal clock counter, rather than directly
> using a 32-bit TSC delta. IA32_VMX_MISC[4:0] likely participates in that
> calculation.

That doesn't definitively answer my question. Let me try to rephrase it.

With respect to EMR, you wrote previously, "A mitigation for this
erratum is for software to program the VMX preemption timer for values
below 2^25 * CPUID.15H:EBX[31:0] / CPUID.15H:EAX[31:0]."

My question is whether the exponent, 25, is a fixed value for all
CPUs, regardless of their IA32_VMX_MISC[4:0]. It sounds like you are
saying that the exponent may depend on IA32_VMX_MISC[4:0].