Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls

From: Alexey Kardashevskiy
Date: Tue Jun 24 2025 - 04:24:40 EST


On 21/5/25 00:11, Vishal Annapurve wrote:
On Tue, May 20, 2025 at 6:44 AM Fuad Tabba <tabba@xxxxxxxxxx> wrote:

Hi Vishal,

On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@xxxxxxxxxx> wrote:

On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@xxxxxxxxxx> wrote:

Hi Ackerley,

On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@xxxxxxxxxx> wrote:

The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
and private respectively.

I have a high level question about this particular patch and this
approach for conversion: why do we need IOCTLs to manage conversion
between private and shared?

In the presentations I gave at LPC [1, 2], and in my latest patch
series that performs in-place conversion [3] and the associated (by
now outdated) state diagram [4], I didn't see the need to have a
userspace-facing interface to manage that. KVM has all the information
it needs to handle conversions, which are triggered by the guest. To
me this seems like it adds additional complexity, as well as a user
facing interface that we would need to maintain.

There are various ways we could handle conversion without explicit
interference from userspace. What I had in mind is the following (as
an example, details can vary according to VM type). I will use use the
case of conversion from shared to private because that is the more
complicated (interesting) case:

- Guest issues a hypercall to request that a shared folio become private.

- The hypervisor receives the call, and passes it to KVM.

- KVM unmaps the folio from the guest stage-2 (EPT I think in x86
parlance), and unmaps it from the host. The host however, could still
have references (e.g., GUP).

- KVM exits to the host (hypervisor call exit), with the information
that the folio has been unshared from it.

- A well behaving host would now get rid of all of its references
(e.g., release GUPs), perform a VCPU run, and the guest continues
running as normal. I expect this to be the common case.

But to handle the more interesting situation, let's say that the host
doesn't do it immediately, and for some reason it holds on to some
references to that folio.

- Even if that's the case, the guest can still run *. If the guest
tries to access the folio, KVM detects that access when it tries to
fault it into the guest, sees that the host still has references to
that folio, and exits back to the host with a memory fault exit. At
this point, the VCPU that has tried to fault in that particular folio
cannot continue running as long as it cannot fault in that folio.

Are you talking about the following scheme?
1) guest_memfd checks shareability on each get pfn and if there is a
mismatch exit to the host.

I think we are not really on the same page here (no pun intended :) ).
I'll try to answer your questions anyway...

Which get_pfn? Are you referring to get_pfn when faulting the page
into the guest or into the host?

I am referring to guest fault handling in KVM.


2) host user space has to guess whether it's a pending refcount or
whether it's an actual mismatch.

No need to guess. VCPU run will let it know exactly why it's exiting.

3) guest_memfd will maintain a third state
"pending_private_conversion" or equivalent which will transition to
private upon the last refcount drop of each page.

If conversion is triggered by userspace (in case of pKVM, it will be
triggered from within the KVM (?)):

Why would conversion be triggered by userspace? As far as I know, it's
the guest that triggers the conversion.

* Conversion will just fail if there are extra refcounts and userspace
can try to get rid of extra refcounts on the range while it has enough
context without hitting any ambiguity with memory fault exit.
* guest_memfd will not have to deal with this extra state from 3 above
and overall guest_memfd conversion handling becomes relatively
simpler.

That's not really related. The extra state isn't necessary any more
once we agreed in the previous discussion that we will retry instead.

Who is *we* here? Which entity will retry conversion?


Note that for x86 CoCo cases, memory conversion is already triggered
by userspace using KVM ioctl, this series is proposing to use
guest_memfd ioctl to do the same.

The reason why for x86 CoCo cases conversion is already triggered by
userspace using KVM ioctl is that it has to, since shared memory and
private memory are two separate pages, and userspace needs to manage
that. Sharing memory in place removes the need for that.

Userspace still needs to clean up memory usage before conversion is
successful. e.g. remove IOMMU mappings for shared to private
conversion. I would think that memory conversion should not succeed
before all existing users let go of the guest_memfd pages for the
range being converted.


Ah about that. Actually IOMMU mappings can remain the same in a case like mine TSM+VFIO RFC based on the Fuad's older patches, here in particular:

https://lore.kernel.org/r/20250218111017.491719-13-aik@xxxxxxx

which works nicely - mapped it once and forgot.

Now, I am rebasing my RFC on top of this patchset and it fails in kvm_gmem_has_safe_refcount() as IOMMU holds references to all these folios in my RFC.

So what is the expected sequence here? The userspace unmaps a DMA page and maps it back right away, all from the userspace? The end result will be the exactly same which seems useless. And IOMMU TLB is going to be flushed on a page conversion anyway (the RMPUPDATE instruction does that). All this is about AMD's x86 though.

For now (and for fun^wexperiment) I disabled kvm_gmem_has_safe_refcount() (04/51 adds it) and it seems to have no effect untiiil memfd is closed - folios_put_refs() crashes in list_del(&folio->lru). I wonder now what direction to go from here.

My TSM+VFIO RFC uses the hw ability to DMA to/from Coco VM (==AMD SEV SNP VM), both private and shared DMA at the same time is going to be allowed. Thanks,



In x86 CoCo usecases, userspace can also decide to not allow
conversion for scenarios where ranges are still under active use by
the host and guest is erroneously trying to take away memory. Both
SNP/TDX spec allow failure of conversion due to in use memory.


This series isn't using the same ioctl, it's introducing new ones to
perform a task that as far as I can tell so far, KVM can handle by
itself.

I would like to understand this better. How will KVM handle the
conversion process for guest_memfd pages? Can you help walk an example
sequence for shared to private conversion specifically around
guest_memfd offset states?


- Allows not having to keep track of separate shared/private range
information in KVM.

This patch series is already tracking shared/private range information in KVM.

- Simpler handling of the conversion process done per guest_memfd
rather than for full range.
- Userspace can handle the rollback as needed, simplifying error
handling in guest_memfd.
- guest_memfd is single source of truth and notifies the users of
shareability change.
- e.g. IOMMU, userspace, KVM MMU all can be registered for
getting notifications from guest_memfd directly and will get notified
for invalidation upon shareability attribute updates.

All of these can still be done without introducing a new ioctl.

Cheers,
/fuad

--
Alexey