Re: [PATCH v14 14/44] arm64: RMI: Basic infrastructure for creating a realm.
From: Steven Price
Date: Thu Jun 04 2026 - 12:04:08 EST
On 02/06/2026 15:49, Suzuki K Poulose wrote:
> Hi Marc
>
> On 28/05/2026 08:10, Marc Zyngier wrote:
>> On Wed, 13 May 2026 14:17:22 +0100,
>> Steven Price <steven.price@xxxxxxx> wrote:
>>>
>>> Introduce the skeleton functions for creating and destroying a realm.
>>> The IPA size requested is checked against what the RMM supports.
>>>
>>> The actual work of constructing the realm will be added in future
>>> patches.
>>
>> Again, $SUBJECT doesn't reflect that this is purely a KVM patch.
Indeed - "KVM: arm64: CCA" is a better prefix.
>>>
>>> Signed-off-by: Steven Price <steven.price@xxxxxxx>
>>> ---
>>> Changes since v13:
>>> * Rebased and updated to RMM-v2.0-bet1.
>>> * Auxiliary granules have been removed in RMM-v2.0-bet1
>>> Changes since v12:
>>> * Drop the RMM_PAGE_{SHIFT,SIZE} defines - the RMM is now
>>> configured to
>>> be the same as the host's page size.
>>> * Rework delegate/undelegate functions to use the new RMI range based
>>> operations.
>>> Changes since v11:
>>> * Major rework to drop the realm configuration and make the
>>> construction of realms implicit rather than driven by the VMM
>>> directly.
>>> * The code to create RDs, handle VMIDs etc is moved to later patches.
>>> Changes since v10:
>>> * Rename from RME to RMI.
>>> * Move the stage2 cleanup to a later patch.
>>> Changes since v9:
>>> * Avoid walking the stage 2 page tables when destroying the realm -
>>> the real ones are not accessible to the non-secure world, and the
>>> RMM
>>> may leave junk in the physical pages when returning them.
>>> * Fix an error path in realm_create_rd() to actually return an
>>> error value.
>>> Changes since v8:
>>> * Fix free_delegated_granule() to not call
>>> kvm_account_pgtable_pages();
>>> a separate wrapper will be introduced in a later patch to deal with
>>> RTTs.
>>> * Minor code cleanups following review.
>>> Changes since v7:
>>> * Minor code cleanup following Gavin's review.
>>> Changes since v6:
>>> * Separate RMM RTT calculations from host PAGE_SIZE. This allows the
>>> host page size to be larger than 4k while still communicating
>>> with an
>>> RMM which uses 4k granules.
>>> Changes since v5:
>>> * Introduce free_delegated_granule() to replace many
>>> undelegate/free_page() instances and centralise the comment on
>>> leaking when the undelegate fails.
>>> * Several other minor improvements suggested by reviews - thanks for
>>> the feedback!
>>> Changes since v2:
>>> * Improved commit description.
>>> * Improved return failures for rmi_check_version().
>>> * Clear contents of PGD after it has been undelegated in case the RMM
>>> left stale data.
>>> * Minor changes to reflect changes in previous patches.
>>> ---
>>> arch/arm64/include/asm/kvm_emulate.h | 29 ++++++++++++++
>>> arch/arm64/include/asm/kvm_rmi.h | 51 +++++++++++++++++++++++++
>>> arch/arm64/kvm/arm.c | 12 ++++++
>>> arch/arm64/kvm/mmu.c | 12 +++++-
>>> arch/arm64/kvm/rmi.c | 57 ++++++++++++++++++++++++++++
>>> 5 files changed, 159 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/
>>> include/asm/kvm_emulate.h
>>> index 5bf3d7e1d92c..82fd777bd9bb 100644
>>> --- a/arch/arm64/include/asm/kvm_emulate.h
>>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>>> @@ -688,4 +688,33 @@ static inline void vcpu_set_hcrx(struct kvm_vcpu
>>> *vcpu)
>>> vcpu->arch.hcrx_el2 |= HCRX_EL2_EnASR;
>>> }
>>> }
>>> +
>>> +static inline bool kvm_is_realm(struct kvm *kvm)
>>> +{
>>> + if (static_branch_unlikely(&kvm_rmi_is_available))
>>> + return kvm->arch.is_realm;
>>> + return false;
>>> +}
>>> +
>>> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
>>> +{
>>> + return READ_ONCE(kvm->arch.realm.state);
>>> +}
>>> +
>>> +static inline void kvm_set_realm_state(struct kvm *kvm,
>>> + enum realm_state new_state)
>>> +{
>>> + WRITE_ONCE(kvm->arch.realm.state, new_state);
>>> +}
>>> +
>>> +static inline bool kvm_realm_is_created(struct kvm *kvm)
>>> +{
>>> + return kvm_is_realm(kvm) && kvm_realm_state(kvm) !=
>>> REALM_STATE_NONE;
>>> +}
>>> +
>>> +static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>>> +{
>>> + return false;
>>> +}
>>> +
>>> #endif /* __ARM64_KVM_EMULATE_H__ */
>>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>>> asm/kvm_rmi.h
>>> index 4936007947fd..9de34983ee52 100644
>>> --- a/arch/arm64/include/asm/kvm_rmi.h
>>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>>> @@ -6,12 +6,63 @@
>>> #ifndef __ASM_KVM_RMI_H
>>> #define __ASM_KVM_RMI_H
>>> +#include <asm/rmi_smc.h>
>>> +
>>> +/**
>>> + * enum realm_state - State of a Realm
>>> + */
>>> +enum realm_state {
>>> + /**
>>> + * @REALM_STATE_NONE:
>>> + * Realm has not yet been created. rmi_realm_create() has not
>>> + * yet been called.
>>> + */
>>> + REALM_STATE_NONE,
>>> + /**
>>> + * @REALM_STATE_NEW:
>>> + * Realm is under construction, rmi_realm_create() has been
>>> + * called, but it is not yet activated. Pages may be
>>> populated.
>>> + */
>>> + REALM_STATE_NEW,
>>> + /**
>>> + * @REALM_STATE_ACTIVE:
>>> + * Realm has been created and is eligible for execution with
>>> + * rmi_rec_enter(). Pages may no longer be populated with
>>> + * rmi_data_create().
>>> + */
>>> + REALM_STATE_ACTIVE,
>>> + /**
>>> + * @REALM_STATE_DYING:
>>> + * Realm is in the process of being destroyed or has
>>> already been
>>> + * destroyed.
>>> + */
>>> + REALM_STATE_DYING,
>>> + /**
>>> + * @REALM_STATE_DEAD:
>>> + * Realm has been destroyed.
>>> + */
>>> + REALM_STATE_DEAD
>>> +};
>>
>> What is the ABI status of this state? Is it purely internal to KVM? Or
>> is it something that the RMM actively tracks?
>
> The states are in line with what the RMM maintains for the Realm state,
> (Section A2.2.5 Realm Lifecycle)
> except for :
>
> 1. REALM_STATE_DYING is really a KVM internal state to indicate, we
> are in the process of destroying the Realm and no further requests
> needs to be serviced
>
> 2. We don't track the REALM_SYSTEM_OFF, REALM_ZOMBIE states separately
> as we :
> a) Always TERMINATE the Realm, just before the DESTROY
> b) SYSTEM_OFF is naturally triggering the tear down path, leading to
> DYING.
>
I'll add a comment:
+ * Mirrors the RMM's Realm lifecycle states where they are meaningful to KVM,
+ * with REALM_STATE_DYING being a KVM-internal state used to prevent further
+ * requests while teardown is in progress. KVM does not track REALM_SYSTEM_OFF
+ * or REALM_ZOMBIE separately as they naturally lead to teardown.
>
>
>>
>>> +
>>> /**
>>> * struct realm - Additional per VM data for a Realm
>>> + *
>>> + * @state: The lifetime state machine for the realm
>>> + * @rd: Kernel mapping of the Realm Descriptor (RD)
>>> + * @params: Parameters for the RMI_REALM_CREATE command
>>> + * @ia_bits: Number of valid Input Address bits in the IPA
>>> */
>>> struct realm {
>>> + enum realm_state state;
>>> + void *rd;
>>
>> Why is this void? Doesn't it have a proper type?
>
> Not really. This is an object that RMM manages (Realm Descriptor)
> in the Realm world. We use it as a parameter to address the Realm.
>
>
>>
>>> + struct realm_params *params;
>>> + unsigned int ia_bits;
>>
>> Consider reordering this structure to avoid holes.
Sure
>>> };
>>> void kvm_init_rmi(void);
>>> +u32 kvm_realm_ipa_limit(void);
>>
>> The use of 'realm' is confusing. This is not a per-realm property, but
>> something global. I'd rather reserve the term 'realm' for CCA VMs (cue
>> the two prototypes below).
>
> Agreed. Perhaps, kvm_rmm_ipa_limit() ?
Sounds good to me.
>
>>
>>> +
>>> +int kvm_init_realm(struct kvm *kvm);
>>> +void kvm_destroy_realm(struct kvm *kvm);
>>> #endif /* __ASM_KVM_RMI_H */
>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>> index 247e03b33035..18251e561524 100644
>>> --- a/arch/arm64/kvm/arm.c
>>> +++ b/arch/arm64/kvm/arm.c
>>> @@ -264,6 +264,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned
>>> long type)
>>> bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
>>> + /* Initialise the realm bits after the generic bits are
>>> enabled */
>>> + if (kvm_is_realm(kvm)) {
>>> + ret = kvm_init_realm(kvm);
>>> + if (ret)
>>> + goto err_uninit_mmu;
>>> + }
>>> +
>>> return 0;
>>> err_uninit_mmu:
>>> @@ -326,6 +333,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>>> kvm_unshare_hyp(kvm, kvm + 1);
>>> kvm_arm_teardown_hypercalls(kvm);
>>> + if (kvm_is_realm(kvm))
>>> + kvm_destroy_realm(kvm);
>>> }
>>> static bool kvm_has_full_ptr_auth(void)
>>> @@ -486,6 +495,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm,
>>> long ext)
>>> else
>>> r = kvm_supports_cacheable_pfnmap();
>>> break;
>>> + case KVM_CAP_ARM_RMI:
>>> + r = static_key_enabled(&kvm_rmi_is_available);
>>> + break;
>>> default:
>>> r = 0;
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index d089c107d9b7..ba8286472286 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -877,10 +877,14 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>>> static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned
>>> long type)
>>> {
>>> + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>>> u32 kvm_ipa_limit = get_kvm_ipa_limit();
>>> u64 mmfr0, mmfr1;
>>> u32 phys_shift;
>>> + if (kvm_is_realm(kvm))
>>> + kvm_ipa_limit = kvm_realm_ipa_limit();
>>> +
>>> phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
>>> if (is_protected_kvm_enabled()) {
>>> phys_shift = kvm_ipa_limit;
>>> @@ -974,6 +978,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct
>>> kvm_s2_mmu *mmu, unsigned long t
>>> return -EINVAL;
>>> }
>>> + mmu->arch = &kvm->arch;
>>> +
>>> err = kvm_init_ipa_range(mmu, type);
>>> if (err)
>>> return err;
>>> @@ -982,7 +988,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct
>>> kvm_s2_mmu *mmu, unsigned long t
>>> if (!pgt)
>>> return -ENOMEM;
>>> - mmu->arch = &kvm->arch;
>>
>> Why moving this init?
>
> Because, we need to know the "kvm" instance for kvm_init_ipa_range to
> detect the limit that applies to Realms.
>
>>
>>> err = KVM_PGT_FN(kvm_pgtable_stage2_init)(pgt, mmu,
>>> &kvm_s2_mm_ops);
>>> if (err)
>>> goto out_free_pgtable;
>>> @@ -1114,7 +1119,10 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>>> write_unlock(&kvm->mmu_lock);
>>> if (pgt) {
>>> - kvm_stage2_destroy(pgt);
>>> + if (!kvm_is_realm(kvm))
>>> + kvm_stage2_destroy(pgt);
>>> + else
>>> + kvm_pgtable_stage2_destroy_pgd(pgt);
>>
>> Why can't you make kvm_stage2_destroy() do the right thing? Surely the
>> PTs have to be reclaimed one way or another.
>
> Actually yes, we could make it work. We need to skip walking the page
> table for Realms. We may be able to do the checks via pgt->mmu->arch-
>>kvm and skip the walking for Realms. ( The S2 is unmapped and torn
> down before the RD is destroyed in kvm_destroy_realm(). We can't
> rely on the contents of the PGDs to be zero - e.g., with MEC.)
Yes I'll move the check into kvm_stage2_destroy() instead with a comment
explaining what's going on.
>>
>>> kfree(pgt);
>>> }
>>> }
>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>> index 6e28b669ded2..f51ec667445e 100644
>>> --- a/arch/arm64/kvm/rmi.c
>>> +++ b/arch/arm64/kvm/rmi.c
>>> @@ -5,6 +5,8 @@
>>> #include <linux/kvm_host.h>
>>> +#include <asm/kvm_emulate.h>
>>> +#include <asm/kvm_mmu.h>
>>> #include <asm/kvm_pgtable.h>
>>> #include <asm/rmi_cmds.h>
>>> #include <asm/virt.h>
>>> @@ -14,6 +16,61 @@ static bool rmi_has_feature(unsigned long feature)
>>> return !!u64_get_bits(rmm_feat_reg0, feature);
>>> }
>>> +u32 kvm_realm_ipa_limit(void)
>>> +{
>>> + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
>>> +}
>>> +
>>> +void kvm_destroy_realm(struct kvm *kvm)
>>> +{
>>> + struct realm *realm = &kvm->arch.realm;
>>> + size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>>> +
>>> + if (realm->params) {
>>> + free_page((unsigned long)realm->params);
>>> + realm->params = NULL;
>>> + }
>>> +
>>> + if (!kvm_realm_is_created(kvm))
>>> + return;
>>> +
>>> + kvm_set_realm_state(kvm, REALM_STATE_DYING);
>>> +
>>> + write_lock(&kvm->mmu_lock);
>>> + kvm_stage2_unmap_range(&kvm->arch.mmu, 0,
>>> + BIT(realm->ia_bits - 1), true);
>>> + write_unlock(&kvm->mmu_lock);
>>> +
>>> + if (realm->rd) {
>>> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
>>> +
>>> + if (WARN_ON(rmi_realm_terminate(rd_phys)))
>>> + return;
>>> +
>>> + if (WARN_ON(rmi_realm_destroy(rd_phys)))
>>> + return;
>>> + free_delegated_page(rd_phys);
>>> + realm->rd = NULL;
>>> + }
>>> +
>>> + if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys,
>>> pgd_size)))
>>> + return;
>>> +
>>> + kvm_set_realm_state(kvm, REALM_STATE_DEAD);
>>> +
>>> + /* Now that the Realm is destroyed, free the entry level RTTs */
>>> + kvm_free_stage2_pgd(&kvm->arch.mmu);
>>> +}
>>
>> This really needs documentation: what happens at each stage? What
>> memory is reclaimed when?
>
> Agreed.
>
>>
>> But even more importantly, why is this built in a completely parallel
>> way, potentially deviating from the existing KVM S2 management?
>
>
> RMM requires a Realm is not live at the time of REALM_DESTROY.
> (See section A2.2.4 Realm Liveness).
> i.e., All RECs are destroyed, Root RTTs wiped clean (no live mappings)
> before the RD is destroyed. So, we need to make sure all of this is
> done at Realm Destroy. Hence we delay the kvm_free_stage2_pgd() until
> we destroy the RD.
>
> Does that help? May be we could improve the comments around it.
I'll add a comment in kvm_destroy_realm().
Thanks,
Steve
>
> Suzuki
>
>
>
>> Thanks,>
>> M.
>>
>