Re: [PATCH v14 14/44] arm64: RMI: Basic infrastructure for creating a realm.

From: Steven Price

Date: Thu Jun 04 2026 - 12:04:08 EST


On 02/06/2026 15:49, Suzuki K Poulose wrote:
> Hi Marc
>
> On 28/05/2026 08:10, Marc Zyngier wrote:
>> On Wed, 13 May 2026 14:17:22 +0100,
>> Steven Price <steven.price@xxxxxxx> wrote:
>>>
>>> Introduce the skeleton functions for creating and destroying a realm.
>>> The IPA size requested is checked against what the RMM supports.
>>>
>>> The actual work of constructing the realm will be added in future
>>> patches.
>>
>> Again, $SUBJECT doesn't reflect that this is purely a KVM patch.

Indeed - "KVM: arm64: CCA" is a better prefix.

>>>
>>> Signed-off-by: Steven Price <steven.price@xxxxxxx>
>>> ---
>>> Changes since v13:
>>>   * Rebased and updated to RMM-v2.0-bet1.
>>>   * Auxiliary granules have been removed in RMM-v2.0-bet1
>>> Changes since v12:
>>>   * Drop the RMM_PAGE_{SHIFT,SIZE} defines - the RMM is now
>>> configured to
>>>     be the same as the host's page size.
>>>   * Rework delegate/undelegate functions to use the new RMI range based
>>>     operations.
>>> Changes since v11:
>>>   * Major rework to drop the realm configuration and make the
>>>     construction of realms implicit rather than driven by the VMM
>>>     directly.
>>>   * The code to create RDs, handle VMIDs etc is moved to later patches.
>>> Changes since v10:
>>>   * Rename from RME to RMI.
>>>   * Move the stage2 cleanup to a later patch.
>>> Changes since v9:
>>>   * Avoid walking the stage 2 page tables when destroying the realm -
>>>     the real ones are not accessible to the non-secure world, and the
>>> RMM
>>>     may leave junk in the physical pages when returning them.
>>>   * Fix an error path in realm_create_rd() to actually return an
>>> error value.
>>> Changes since v8:
>>>   * Fix free_delegated_granule() to not call
>>> kvm_account_pgtable_pages();
>>>     a separate wrapper will be introduced in a later patch to deal with
>>>     RTTs.
>>>   * Minor code cleanups following review.
>>> Changes since v7:
>>>   * Minor code cleanup following Gavin's review.
>>> Changes since v6:
>>>   * Separate RMM RTT calculations from host PAGE_SIZE. This allows the
>>>     host page size to be larger than 4k while still communicating
>>> with an
>>>     RMM which uses 4k granules.
>>> Changes since v5:
>>>   * Introduce free_delegated_granule() to replace many
>>>     undelegate/free_page() instances and centralise the comment on
>>>     leaking when the undelegate fails.
>>>   * Several other minor improvements suggested by reviews - thanks for
>>>     the feedback!
>>> Changes since v2:
>>>   * Improved commit description.
>>>   * Improved return failures for rmi_check_version().
>>>   * Clear contents of PGD after it has been undelegated in case the RMM
>>>     left stale data.
>>>   * Minor changes to reflect changes in previous patches.
>>> ---
>>>   arch/arm64/include/asm/kvm_emulate.h | 29 ++++++++++++++
>>>   arch/arm64/include/asm/kvm_rmi.h     | 51 +++++++++++++++++++++++++
>>>   arch/arm64/kvm/arm.c                 | 12 ++++++
>>>   arch/arm64/kvm/mmu.c                 | 12 +++++-
>>>   arch/arm64/kvm/rmi.c                 | 57 ++++++++++++++++++++++++++++
>>>   5 files changed, 159 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/
>>> include/asm/kvm_emulate.h
>>> index 5bf3d7e1d92c..82fd777bd9bb 100644
>>> --- a/arch/arm64/include/asm/kvm_emulate.h
>>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>>> @@ -688,4 +688,33 @@ static inline void vcpu_set_hcrx(struct kvm_vcpu
>>> *vcpu)
>>>               vcpu->arch.hcrx_el2 |= HCRX_EL2_EnASR;
>>>       }
>>>   }
>>> +
>>> +static inline bool kvm_is_realm(struct kvm *kvm)
>>> +{
>>> +    if (static_branch_unlikely(&kvm_rmi_is_available))
>>> +        return kvm->arch.is_realm;
>>> +    return false;
>>> +}
>>> +
>>> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
>>> +{
>>> +    return READ_ONCE(kvm->arch.realm.state);
>>> +}
>>> +
>>> +static inline void kvm_set_realm_state(struct kvm *kvm,
>>> +                       enum realm_state new_state)
>>> +{
>>> +    WRITE_ONCE(kvm->arch.realm.state, new_state);
>>> +}
>>> +
>>> +static inline bool kvm_realm_is_created(struct kvm *kvm)
>>> +{
>>> +    return kvm_is_realm(kvm) && kvm_realm_state(kvm) !=
>>> REALM_STATE_NONE;
>>> +}
>>> +
>>> +static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>>> +{
>>> +    return false;
>>> +}
>>> +
>>>   #endif /* __ARM64_KVM_EMULATE_H__ */
>>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>>> asm/kvm_rmi.h
>>> index 4936007947fd..9de34983ee52 100644
>>> --- a/arch/arm64/include/asm/kvm_rmi.h
>>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>>> @@ -6,12 +6,63 @@
>>>   #ifndef __ASM_KVM_RMI_H
>>>   #define __ASM_KVM_RMI_H
>>>   +#include <asm/rmi_smc.h>
>>> +
>>> +/**
>>> + * enum realm_state - State of a Realm
>>> + */
>>> +enum realm_state {
>>> +    /**
>>> +     * @REALM_STATE_NONE:
>>> +     *      Realm has not yet been created. rmi_realm_create() has not
>>> +     *      yet been called.
>>> +     */
>>> +    REALM_STATE_NONE,
>>> +    /**
>>> +     * @REALM_STATE_NEW:
>>> +     *      Realm is under construction, rmi_realm_create() has been
>>> +     *      called, but it is not yet activated. Pages may be
>>> populated.
>>> +     */
>>> +    REALM_STATE_NEW,
>>> +    /**
>>> +     * @REALM_STATE_ACTIVE:
>>> +     *      Realm has been created and is eligible for execution with
>>> +     *      rmi_rec_enter(). Pages may no longer be populated with
>>> +     *      rmi_data_create().
>>> +     */
>>> +    REALM_STATE_ACTIVE,
>>> +    /**
>>> +     * @REALM_STATE_DYING:
>>> +     *      Realm is in the process of being destroyed or has
>>> already been
>>> +     *      destroyed.
>>> +     */
>>> +    REALM_STATE_DYING,
>>> +    /**
>>> +     * @REALM_STATE_DEAD:
>>> +     *      Realm has been destroyed.
>>> +     */
>>> +    REALM_STATE_DEAD
>>> +};
>>
>> What is the ABI status of this state? Is it purely internal to KVM? Or
>> is it something that the RMM actively tracks?
>
> The states are in line with what the RMM maintains for the Realm state,
> (Section A2.2.5 Realm Lifecycle)
> except for :
>
> 1. REALM_STATE_DYING is really a KVM internal state to indicate, we
> are in the process of destroying the Realm and no further requests
> needs to be serviced
>
> 2. We don't track the REALM_SYSTEM_OFF, REALM_ZOMBIE states separately
> as we :
>  a) Always TERMINATE the Realm, just before the DESTROY
>  b) SYSTEM_OFF is naturally triggering the tear down path, leading to
> DYING.
>

I'll add a comment:

+ * Mirrors the RMM's Realm lifecycle states where they are meaningful to KVM,
+ * with REALM_STATE_DYING being a KVM-internal state used to prevent further
+ * requests while teardown is in progress. KVM does not track REALM_SYSTEM_OFF
+ * or REALM_ZOMBIE separately as they naturally lead to teardown.

>
>
>>
>>> +
>>>   /**
>>>    * struct realm - Additional per VM data for a Realm
>>> + *
>>> + * @state: The lifetime state machine for the realm
>>> + * @rd: Kernel mapping of the Realm Descriptor (RD)
>>> + * @params: Parameters for the RMI_REALM_CREATE command
>>> + * @ia_bits: Number of valid Input Address bits in the IPA
>>>    */
>>>   struct realm {
>>> +    enum realm_state state;
>>> +    void *rd;
>>
>> Why is this void? Doesn't it have a proper type?
>
> Not really. This is an object that RMM manages (Realm Descriptor)
> in the Realm world. We use it as a parameter to address the Realm.
>
>
>>
>>> +    struct realm_params *params;
>>> +    unsigned int ia_bits;
>>
>> Consider reordering this structure to avoid holes.

Sure

>>>   };
>>>     void kvm_init_rmi(void);
>>> +u32 kvm_realm_ipa_limit(void);
>>
>> The use of 'realm' is confusing. This is not a per-realm property, but
>> something global. I'd rather reserve the term 'realm' for CCA VMs (cue
>> the two prototypes below).
>
> Agreed. Perhaps, kvm_rmm_ipa_limit() ?

Sounds good to me.

>
>>
>>> +
>>> +int kvm_init_realm(struct kvm *kvm);
>>> +void kvm_destroy_realm(struct kvm *kvm);
>>>     #endif /* __ASM_KVM_RMI_H */
>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>> index 247e03b33035..18251e561524 100644
>>> --- a/arch/arm64/kvm/arm.c
>>> +++ b/arch/arm64/kvm/arm.c
>>> @@ -264,6 +264,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned
>>> long type)
>>>         bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
>>>   +    /* Initialise the realm bits after the generic bits are
>>> enabled */
>>> +    if (kvm_is_realm(kvm)) {
>>> +        ret = kvm_init_realm(kvm);
>>> +        if (ret)
>>> +            goto err_uninit_mmu;
>>> +    }
>>> +
>>>       return 0;
>>>     err_uninit_mmu:
>>> @@ -326,6 +333,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>>>       kvm_unshare_hyp(kvm, kvm + 1);
>>>         kvm_arm_teardown_hypercalls(kvm);
>>> +    if (kvm_is_realm(kvm))
>>> +        kvm_destroy_realm(kvm);
>>>   }
>>>     static bool kvm_has_full_ptr_auth(void)
>>> @@ -486,6 +495,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm,
>>> long ext)
>>>           else
>>>               r = kvm_supports_cacheable_pfnmap();
>>>           break;
>>> +    case KVM_CAP_ARM_RMI:
>>> +        r = static_key_enabled(&kvm_rmi_is_available);
>>> +        break;
>>>         default:
>>>           r = 0;
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index d089c107d9b7..ba8286472286 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -877,10 +877,14 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>>>     static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned
>>> long type)
>>>   {
>>> +    struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>>>       u32 kvm_ipa_limit = get_kvm_ipa_limit();
>>>       u64 mmfr0, mmfr1;
>>>       u32 phys_shift;
>>>   +    if (kvm_is_realm(kvm))
>>> +        kvm_ipa_limit = kvm_realm_ipa_limit();
>>> +
>>>       phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
>>>       if (is_protected_kvm_enabled()) {
>>>           phys_shift = kvm_ipa_limit;
>>> @@ -974,6 +978,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct
>>> kvm_s2_mmu *mmu, unsigned long t
>>>           return -EINVAL;
>>>       }
>>>   +    mmu->arch = &kvm->arch;
>>> +
>>>       err = kvm_init_ipa_range(mmu, type);
>>>       if (err)
>>>           return err;
>>> @@ -982,7 +988,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct
>>> kvm_s2_mmu *mmu, unsigned long t
>>>       if (!pgt)
>>>           return -ENOMEM;
>>>   -    mmu->arch = &kvm->arch;
>>
>> Why moving this init?
>
> Because, we need to know the "kvm" instance for kvm_init_ipa_range to
> detect the limit that applies to Realms.
>
>>
>>>       err = KVM_PGT_FN(kvm_pgtable_stage2_init)(pgt, mmu,
>>> &kvm_s2_mm_ops);
>>>       if (err)
>>>           goto out_free_pgtable;
>>> @@ -1114,7 +1119,10 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>>>       write_unlock(&kvm->mmu_lock);
>>>         if (pgt) {
>>> -        kvm_stage2_destroy(pgt);
>>> +        if (!kvm_is_realm(kvm))
>>> +            kvm_stage2_destroy(pgt);
>>> +        else
>>> +            kvm_pgtable_stage2_destroy_pgd(pgt);
>>
>> Why can't you make kvm_stage2_destroy() do the right thing? Surely the
>> PTs have to be reclaimed one way or another.
>
> Actually yes, we could make it work. We need to skip walking the page
> table for Realms. We may be able to do the checks via pgt->mmu->arch-
>>kvm and skip the walking for Realms. ( The S2 is unmapped and torn
> down before the RD is destroyed in kvm_destroy_realm(). We can't
> rely on the contents of the PGDs to be zero - e.g., with MEC.)

Yes I'll move the check into kvm_stage2_destroy() instead with a comment
explaining what's going on.

>>
>>>           kfree(pgt);
>>>       }
>>>   }
>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>> index 6e28b669ded2..f51ec667445e 100644
>>> --- a/arch/arm64/kvm/rmi.c
>>> +++ b/arch/arm64/kvm/rmi.c
>>> @@ -5,6 +5,8 @@
>>>     #include <linux/kvm_host.h>
>>>   +#include <asm/kvm_emulate.h>
>>> +#include <asm/kvm_mmu.h>
>>>   #include <asm/kvm_pgtable.h>
>>>   #include <asm/rmi_cmds.h>
>>>   #include <asm/virt.h>
>>> @@ -14,6 +16,61 @@ static bool rmi_has_feature(unsigned long feature)
>>>       return !!u64_get_bits(rmm_feat_reg0, feature);
>>>   }
>>>   +u32 kvm_realm_ipa_limit(void)
>>> +{
>>> +    return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
>>> +}
>>> +
>>> +void kvm_destroy_realm(struct kvm *kvm)
>>> +{
>>> +    struct realm *realm = &kvm->arch.realm;
>>> +    size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>>> +
>>> +    if (realm->params) {
>>> +        free_page((unsigned long)realm->params);
>>> +        realm->params = NULL;
>>> +    }
>>> +
>>> +    if (!kvm_realm_is_created(kvm))
>>> +        return;
>>> +
>>> +    kvm_set_realm_state(kvm, REALM_STATE_DYING);
>>> +
>>> +    write_lock(&kvm->mmu_lock);
>>> +    kvm_stage2_unmap_range(&kvm->arch.mmu, 0,
>>> +                   BIT(realm->ia_bits - 1), true);
>>> +    write_unlock(&kvm->mmu_lock);
>>> +
>>> +    if (realm->rd) {
>>> +        phys_addr_t rd_phys = virt_to_phys(realm->rd);
>>> +
>>> +        if (WARN_ON(rmi_realm_terminate(rd_phys)))
>>> +            return;
>>> +
>>> +        if (WARN_ON(rmi_realm_destroy(rd_phys)))
>>> +            return;
>>> +        free_delegated_page(rd_phys);
>>> +        realm->rd = NULL;
>>> +    }
>>> +
>>> +    if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys,
>>> pgd_size)))
>>> +        return;
>>> +
>>> +    kvm_set_realm_state(kvm, REALM_STATE_DEAD);
>>> +
>>> +    /* Now that the Realm is destroyed, free the entry level RTTs */
>>> +    kvm_free_stage2_pgd(&kvm->arch.mmu);
>>> +}
>>
>> This really needs documentation: what happens at each stage? What
>> memory is reclaimed when?
>
> Agreed.
>
>>
>> But even more importantly, why is this built in a completely parallel
>> way, potentially deviating from the existing KVM S2 management?
>
>
> RMM requires a Realm is not live at the time of REALM_DESTROY.
> (See section A2.2.4 Realm Liveness).
> i.e., All RECs are destroyed, Root RTTs wiped clean (no live mappings)
> before the RD is destroyed. So, we need to make sure all of this is
> done at Realm Destroy. Hence we delay the kvm_free_stage2_pgd() until
> we destroy the RD.
>
> Does that help? May be we could improve the comments around it.

I'll add a comment in kvm_destroy_realm().

Thanks,
Steve

>
> Suzuki
>
>
>
>> Thanks,>
>>     M.
>>
>