Re: [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time

From: David Hildenbrand (Arm)

Date: Fri Mar 27 2026 - 04:57:34 EST


On 3/27/26 03:08, Usama Arif wrote:
> When the kernel creates a PMD-level THP mapping for anonymous pages, it
> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
> page table sits unused in a deposit list for the lifetime of the THP
> mapping, only to be withdrawn when the PMD is split or zapped. Every
> anonymous THP therefore wastes 4KB of memory unconditionally. On large
> servers where hundreds of gigabytes of memory are mapped as THPs, this
> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
> could otherwise satisfy other allocations, including the very PTE page
> table allocations needed when splits eventually occur.
>
> This series removes the pre-deposit and allocates the PTE page table
> lazily — only when a PMD split actually happens. Since a large number
> of THPs are never split (they are zapped wholesale when processes exit or
> munmap the full range), the allocation is avoided entirely in the common
> case.
>
> The pre-deposit pattern exists because split_huge_pmd was designed as an
> operation that must never fail: if the kernel decides to split, it needs
> a PTE page table, so one is deposited in advance. But "must never fail"
> is an unnecessarily strong requirement. A PMD split is typically triggered
> by a partial operation on a sub-PMD range — partial munmap, partial
> mprotect, COW on a pinned folio, GUP with FOLL_SPLIT_PMD, and similar.
> All of these operations already have well-defined error handling for
> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
> fail and propagating the error through these existing paths is the natural
> thing to do. Furthermore, if the system cannot satisfy a single order-0
> allocation for a page table, it is under extreme memory pressure and
> failing the operation is the correct response.
>
> Designing functions like split_huge_pmd as operations that cannot fail
> has a subtle but real cost to code quality. It forces a pre-allocation
> pattern - every THP creation path must deposit a page table, and every
> split or zap path must withdraw one, creating a hidden coupling between
> widely separated code paths.
>
> This also serves as a code cleanup. On every architecture except powerpc
> with hash MMU, the deposit/withdraw machinery becomes dead code. The
> series removes the generic implementations in pgtable-generic.c and the
> s390/sparc overrides, replacing them with no-op stubs guarded by
> arch_needs_pgtable_deposit(), which evaluates to false at compile time
> on all non-powerpc architectures.
>
> The series is structured as follows:
>
> Patches 1-2: Infrastructure — make split functions return int and
> propagate errors from vma_adjust_trans_huge() through
> __split_vma, vma_shrink, and commit_merge.
>
> Patches 3-15: Handle split failure at every call site — copy_huge_pmd,
> do_huge_pmd_wp_page, zap_pmd_range, wp_huge_pmd,
> change_pmd_range (mprotect), follow_pmd_mask (GUP),
> walk_pmd_range (pagewalk), move_page_tables (mremap),
> move_pages (userfaultfd), device migration,
> pagemap_scan_thp_entry (proc), powerpc subpage_prot,
> and dax_iomap_pmd_fault (DAX). The code will become
> effective in Patch 17 when split functions start
> returning -ENOMEM.
>
> Patch 16: Add __must_check to __split_huge_pmd(), split_huge_pmd()
> and split_huge_pmd_address() so the compiler warns on
> unchecked return values.
>
> Patch 17: The actual change — allocate PTE page tables lazily at
> split time instead of pre-depositing at THP creation.
> This is when split functions will actually start returning
> -ENOMEM.
>
> Patch 18: Remove the now-dead deposit/withdraw code on
> non-powerpc architectures.
>
> Patch 19: Add THP_SPLIT_PMD_FAILED vmstat counter for monitoring
> split failures.
>
> Patches 20-24: Selftests covering partial munmap, mprotect, mlock,
> mremap, and MADV_DONTNEED on THPs to exercise the
> split paths.
>
> The error handling patches are placed before the lazy allocation patch so
> that every call site is already prepared to handle split failures before
> the failure mode is introduced. This makes each patch independently safe
> to apply and bisect through.
>
> The patches were tested with CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_DEBUG_VM
> enabled. The test results are below:
>
> TAP version 13
> 1..5
> # Starting 5 tests from 1 test cases.
> # RUN thp_pmd_split.partial_munmap ...
> # thp_pmd_split_test.c:60:partial_munmap:thp_split_pmd: 0 -> 1
> # thp_pmd_split_test.c:62:partial_munmap:thp_split_pmd_failed: 0 -> 0
> # OK thp_pmd_split.partial_munmap
> ok 1 thp_pmd_split.partial_munmap
> # RUN thp_pmd_split.partial_mprotect ...
> # thp_pmd_split_test.c:60:partial_mprotect:thp_split_pmd: 1 -> 2
> # thp_pmd_split_test.c:62:partial_mprotect:thp_split_pmd_failed: 0 -> 0
> # OK thp_pmd_split.partial_mprotect
> ok 2 thp_pmd_split.partial_mprotect
> # RUN thp_pmd_split.partial_mlock ...
> # thp_pmd_split_test.c:60:partial_mlock:thp_split_pmd: 2 -> 3
> # thp_pmd_split_test.c:62:partial_mlock:thp_split_pmd_failed: 0 -> 0
> # OK thp_pmd_split.partial_mlock
> ok 3 thp_pmd_split.partial_mlock
> # RUN thp_pmd_split.partial_mremap ...
> # thp_pmd_split_test.c:60:partial_mremap:thp_split_pmd: 3 -> 4
> # thp_pmd_split_test.c:62:partial_mremap:thp_split_pmd_failed: 0 -> 0
> # OK thp_pmd_split.partial_mremap
> ok 4 thp_pmd_split.partial_mremap
> # RUN thp_pmd_split.partial_madv_dontneed ...
> # thp_pmd_split_test.c:60:partial_madv_dontneed:thp_split_pmd: 4 -> 5
> # thp_pmd_split_test.c:62:partial_madv_dontneed:thp_split_pmd_failed: 0 -> 0
> # OK thp_pmd_split.partial_madv_dontneed
> ok 5 thp_pmd_split.partial_madv_dontneed
> # PASSED: 5 / 5 tests passed.
> # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> The patches are based off of mm-unstable as of 25 Mar
> git hash: d6f51e38433489eb22cb65d1bf72ac7993c5bdec
>
> RFC v2 -> v3: https://lore.kernel.org/all/de0dc7ec-7a8d-4b1a-a419-1d97d2e4d510@xxxxxxxxx/

Note that we usually go from RFC to v1.

I'll put this series on my review backlog, but it will take some time
until I get to it (it won't make the next release either way :) ).

--
Cheers,

David