Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

From: Yan Zhao
Date: Tue Jul 01 2025 - 01:27:24 EST


On Mon, Jun 30, 2025 at 07:14:07AM -0700, Vishal Annapurve wrote:
> On Sun, Jun 29, 2025 at 8:17 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> >
> > On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote:
> > > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@xxxxxxxxx> wrote:
> > > >
> > > > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > > > >> Hello,
> > > > >>
> > > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > > >> upstream calls to provide 1G page support for guest_memfd by taking
> > > > >> pages from HugeTLB.
> > > > >>
> > > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > > > >> for guest_memfd patchset (Thanks Fuad!) [1].
> > > > >>
> > > > >> For ease of testing, this series is also available, stitched together,
> > > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > > > >
> > > > > Just to record a found issue -- not one that must be fixed.
> > > > >
> > > > > In TDX, the initial memory region is added as private memory during TD's build
> > > > > time, with its initial content copied from source pages in shared memory.
> > > > > The copy operation requires simultaneous access to both shared source memory
> > > > > and private target memory.
> > > > >
> > > > > Therefore, userspace cannot store the initial content in shared memory at the
> > > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > > > > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > > > > page tables and then check for any extra refcount held for the shared PFN before
> > > > > converting it to private.
> > > >
> > > > I have an idea.
> > > >
> > > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > > > conversion unmap the PFN in shared page tables while keeping the content
> > > > of the page unchanged, right?
> > >
> > > That's correct.
> > >
> > > >
> > > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > > > actually for non-CoCo case actually, that userspace first mmap() it and
> > > > ensure it's shared and writes the initial content to it, after it
> > > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> > >
> > > I think you mean pKVM by non-coco VMs that care about private memory.
> > > Yes, initial memory regions can start as shared which userspace can
> > > populate and then convert the ranges to private.
> > >
> > > >
> > > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > > > wants the private memory to be initialized with initial content, and
> > > > just do in-place TDH.PAGE.ADD in the hook.
> > >
> > > I think this scheme will be cleaner:
> > > 1) Userspace marks the guest_memfd ranges corresponding to initial
> > > payload as shared.
> > > 2) Userspace mmaps and populates the ranges.
> > > 3) Userspace converts those guest_memfd ranges to private.
> > > 4) For both SNP and TDX, userspace continues to invoke corresponding
> > > initial payload preparation operations via existing KVM ioctls e.g.
> > > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
> > > - SNP/TDX KVM logic fetches the right pfns for the target gfns
> > > using the normal paths supported by KVM and passes those pfns directly
> > > to the right trusted module to initialize the "encrypted" memory
> > > contents.
> > > - Avoiding any GUP or memcpy from source addresses.
> > One caveat:
> >
> > when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked.
> > Then kvm_gmem_prepare_folio() is further invoked to zero the folio.
>
> Given that confidential VMs have their own way of initializing private
> memory, I think zeroing makes sense for only shared memory ranges.
> i.e. something like below:
> 1) Don't zero at allocation time.
> 2) If faulting in a shared page and its not uptodate, then zero the
> page and set the page as uptodate.
> 3) Clear uptodate flag on private to shared conversion.
> 4) For faults on private ranges, don't zero the memory.
>
> There might be some other considerations here e.g. pKVM needs
> non-destructive conversion operation, which might need a way to enable
> zeroing at allocation time only.
>
> On a TDX specific note, IIUC, KVM TDX logic doesn't need to clear
> pages on future platforms [1].
Yes, TDX does not need to clear pages on private page allocation.
But current kvm_gmem_prepare_folio() clears private pages in the common path
for both TDX and SEV-SNP.

I just wanted to point out that it's a kind of obstacle that need to be removed
to implement the proposed approach.


> [1] https://lore.kernel.org/lkml/6de76911-5007-4170-bf74-e1d045c68465@xxxxxxxxx/
>
> >
> > > i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD.
> > So, upon here, the pages should not contain the original content?
> >
>
> Pages should contain the original content. Michael is already
> experimenting with similar logic [2] for SNP.
>
> [2] https://lore.kernel.org/lkml/20250613005400.3694904-6-michael.roth@xxxxxxx/