Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
From: Ackerley Tng
Date: Wed Jul 02 2025 - 14:43:38 EST
Yan Zhao <yan.y.zhao@xxxxxxxxx> writes:
> On Tue, Jul 01, 2025 at 03:09:01PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@xxxxxxxxx> writes:
>>
>> > On Mon, Jun 30, 2025 at 10:22:26PM -0700, Vishal Annapurve wrote:
>> >> On Mon, Jun 30, 2025 at 10:04 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
>> >> >
>> >> > On Tue, Jul 01, 2025 at 05:45:54AM +0800, Edgecombe, Rick P wrote:
>> >> > > On Mon, 2025-06-30 at 12:25 -0700, Ackerley Tng wrote:
>> >> > > > > So for this we can do something similar. Have the arch/x86 side of TDX grow
>> >> > > > > a
>> >> > > > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
>> >> > > > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs
>> >> > > > > after
>> >> > > > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the
>> >> > > > > system
>> >> > > > > die. Zap/cleanup paths return success in the buggy shutdown case.
>> >> > > > >
>> >> > > >
>> >> > > > Do you mean that on unmap/split failure:
>> >> > >
>> >> > > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX module
>> >> > My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit in
>> >> > TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
>> >> > about to tear down.
>> >> >
>> >> > So, it could be due to KVM or TDX module bugs, which retries can't help.
>> >> >
>> >> > > bugs. Not TDX busy errors, demote failures, etc. If there are "normal" failures,
>> >> > > like the ones that can be fixed with retries, then I think HWPoison is not a
>> >> > > good option though.
>> >> > >
>> >> > > > there is a way to make 100%
>> >> > > > sure all memory becomes re-usable by the rest of the host, using
>> >> > > > tdx_buggy_shutdown(), wbinvd, etc?
>> >> >
>> >> > Not sure about this approach. When TDX module is buggy and the page is still
>> >> > accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
>> >> > safe enough for guest_memfd/hugetlb to re-assign the page to allow simultaneous
>> >> > access in shared memory with potential private access from TD or TDX module?
>> >>
>> >> If no more seamcalls are allowed and all cpus are made to exit SEAM
>> >> mode then how can there be potential private access from TD or TDX
>> >> module?
>> > Not sure. As Kirill said "TDX module has creative ways to corrupt it"
>> > https://lore.kernel.org/all/zlxgzuoqwrbuf54wfqycnuxzxz2yduqtsjinr5uq4ss7iuk2rt@qaaolzwsy6ki/.
>> >
>> > Or, could TDX just set a page flag, like what for XEN
>> >
>> > /* XEN */
>> > /* Pinned in Xen as a read-only pagetable page. */
>> > PG_pinned = PG_owner_priv_1,
>> >
>> > e.g.
>> > PG_tdx_firmware_access = PG_owner_priv_1,
>> >
>> > Then, guest_memfd checks this flag on every zap and replace it with PG_hwpoison
>> > on behalf of TDX?
>>
>> I think this question probably arose because of a misunderstanding I
>> might have caused. I meant to set the HWpoison flag from the kernel, not
>> from within the TDX module. Please see [1].
> I understood.
> But as Rick pointed out
> https://lore.kernel.org/all/04d3e455d07042a0ab8e244e6462d9011c914581.camel@xxxxxxxxx/,
> Manually setting the poison flag in KVM's TDX code (in host kernel) seems risky.
>
Will address this in a reply to Rick's email, there's more context
there that I'd like to clarify.
>> In addition, if the TDX module (now referring specifically to the TDX
>> module and not the kernel) sets page flags, that won't work with
> Marking at per-folio level seems acceptable to me.
>
Will address this in a reply to Rick's email, there's more context there
that I'd like to clarify.
>> vmemmap-optimized folios. Setting a page flag on a vmemmap-optimized
>> folio will be setting the flag on a few pages.
> BTW, I have a concern regarding to the overhead vmemmap-optimization.
>
> In my system,
> with hugetlb_free_vmemmap=false, the TD boot time is around 30s;
> with hugetlb_free_vmemmap=true, the TD boot time is around 1m20s;
>
>
I'm aware of this, was investigating this for something similar
internally. In your system and test, were you working with 1G pages, or
2M pages?
>> [1] https://lore.kernel.org/all/diqzplej4llh.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/