Re: [PATCH v10 12/21] gpu: nova-core: mm: Add unified page table entry wrapper enums

From: Joel Fernandes

Date: Wed Apr 08 2026 - 16:20:49 EST

Hi Alex, Eliot, Danilo,

Thanks for taking a look. Let me respond to the specific points below.

On Wed, 08 Apr 2026, Alexandre Courbot wrote:
> After a quick look I'd say that having a trait here would actually be
> *good* for correctness and maintainability.
>
> The current design implies that every operation on a page table (most
> likely using the walker) goes through a branching point. Just looking at
> `PtWalk::read_pte_at_level`, there are already at least 6
> `if version == 2 { } else { }` branches that all resolve to the same
> result. Include walking down the PDEs and you have at least a dozen of
> these just to resolve a virtual address. I know CPUs are fast, but this
> is still wasted cycles for no good reason.

I did some measurements and there is no notieceable difference in both
approaches. I ran perf and loaded nova with self-tests running. The extra
potential branching is lost in the noise. In both cases, loading nova and
running the self-tests has ~119.7M branch instructions on my Ampere. The total
instruction count is also identical (~615M).

I measured like this:
perf stat -e
branches,branch-misses,cache-references,cache-misses,instructions,cycles --
modprobe nova_core

So I think the branching argument is not a strong one. I also did more
measurements and the dominant time taken is MMIO. During the map prep and
execute, page table walks are done. A TLB flush alone costs ~1.4 microseconds.
And PRAMIN BAR0 writes to write the PTE is also about 1 microsecond. Considering
this, I don't think the extra branching argument holds (even without branch
prediction and speculation).

Also some branches cannot be eliminated even with parameterization:

if level == self.mmu_version.dual_pde_level() {
// 128-bit dual PDE read
} else {
// Regular 64-bit PDE read
}

This isn't really a version branch -- it's a structural branch that
distinguishes between 64-bit PDE and 128-bit dual PDE entries. Any MMU
version with a dual PDE level would need this same distinction.

I also did code-generation size analysis (see diff of code used below):

Code generation analysis:

Module .ko size: Before: 511,792 bytes After: 524,464 bytes (+2.5%)
.text section: Before: 112,620 bytes After: 116,628 bytes (+4,008 bytes)

The +4K .text growth is the monomorphization cost: every generic function
is compiled twice (once for MmuV2, once for MmuV3).

> If you use a trait here, and make `PtWalk` generic against it, you can
> optimize this away. We had a similar situation when we introduced Turing
> support and the v2 ucode header, and tried both approaches: the
> trait-based one was slightly shorter, and arguably more readable.

Actually I was the one who suggested traits for Falcon ucode descriptor if you
see this thread [1]. So basically you and Eliot are telling me to do what I
suggested in [1]. :-) However, I disagree that it is the right choice for this code.

[1] https://lore.kernel.org/all/20251117231028.GA1095236@joelbox2/

I think the two cases are quite different in complexity:

The falcon ucode descriptor is essentially a set of flat field accessors
and a few params (imem_sec_load_params, dmem_load_params).
The trait has ~10 simple getter methods. There's no multi-level hierarchy,
no walker, and no generic propagation.

The MMU page table case is structurally different. Making PtWalk generic
over an Mmu trait would require:

- PtWalk<M: Mmu> (the walker)
- Plus all the associated types: M::Pte, M::Pde, M::DualPde each
needing their own trait bounds

And we would also need:
- Vmm<M: Mmu> (which creates PtWalk)
- BarUser<M: Mmu> (which creates Vmm)

I am also against making Vmm an enum as Eliot suggested:
enum Vmm {
V2(VmmInner<MmuV2>),
V3(VmmInner<MmuV3>),
}

That moves the version complexity up to the reader. Code complexity IMO should
decrease as we go up abstractions, making it easier for users (Vmm/Bar).

If you look at the the changes in vmm.rs to handle version dispatch there [2]:
Added: +109
Removed: -28

[2]
https://github.com/Edgeworth/linux/commit/3627af550b61256184d589e7ec666c1108971f0e

The main benefit of my approach is version-specific dispatch complexity is
completely isolated inside MmuVersion thus making the code outside of
pagetable.rs much more readable, without having to parametrize anything, and
without code size increase. I think that is worth considering.

> But the main argument to use a trait here IMO is that it enables
> associated types and constants. That's particularly critical since some
> equivalent fields have different lengths between v2 and v3. An
> associated `Bounded` type for these would force the caller to validate
> the length of these fields before calling a non-fallible operation,
> which is exactly the level of caution that we want when dealing with
> page tables.

I think Bounded validation is orthogonal to the dispatch model.
We can add Bounded to the current design without restructuring
into traits. For example:

// In ver2::Pte
pub fn new_vram(pfn: Bounded<Pfn, 25>, writable: bool) -> Self { ... }

// In ver3::Pte
pub fn new_vram(pfn: Bounded<Pfn, 40>, writable: bool) -> Self { ... }

The unified Pte enum wrapper already dispatches to the correct
version-specific constructor, which would enforce the correct Bounded
constraint for that version.

> In order to fully benefit from it, we will need the bitfield macro from
> the `kernel` crate so the PDE/PTE fields can be `Bounded`, I will try to
> make it available quickly in a patch that you can depend on.

That would be great, and I'd be happy to integrate Bounded validation once
the macro is available. I just don't think we need to restructure the
dispatch model in order to benefit from it.

> But long story short, and although I need to dive deeper into the code,
> this looks like a good candidate for using a trait and associated types.

The walker code (walk.rs) is already version-agnostic and reads cleanly.
The version dispatch is encapsulated behind method calls, not exposed as
inline if/else blocks.

Generic propagation (or version-specific dispatch at higher levels) adds more
complexity at higher layers.

Enclosed below [3] is the diff I used for my testing with the data, I don't
really see a net readability win there (IMO, it is a net-loss in readability).

[3]
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=trait-pt-dispatch&id=5eb0e98af11ba608ff4d0f7a06065ee863f5066a

thanks,

--
Joel Fernandes