Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()

From: Salunke, Hrushikesh

Date: Thu Apr 09 2026 - 05:28:57 EST

On 09-04-2026 14:30, David Hildenbrand (Arm) wrote:

> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On 4/9/26 10:55, Salunke, Hrushikesh wrote:
>> On 08-04-2026 21:02, Andrew Morton wrote:
>>
>>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>>
>>>
>>> On Wed, 8 Apr 2026 16:14:03 +0530 "Salunke, Hrushikesh" <hsalunke@xxxxxxx> wrote:
>>>
>>>> kernel_init_pages() runs inside the allocator (post_alloc_hook and
>>>> __free_pages_prepare), so it inherits whatever context the caller is in.
>>>> Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
>>>> hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
>>>> page allocation happens while the PTE lock and RCU read lock are held,
>>>> making the cond_resched() in the clearing loop illegal:
>>>>
>>>> [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
>>>> [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
>>>> [ 1997.353572] preempt_count: 1, expected: 0
>>>> [ 1997.353706] RCU nest depth: 1, expected: 0
>>>> [ 1997.353837] 3 locks held by bash/19725:
>>>> [ 1997.353839] #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
>>>> [ 1997.353850] #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
>>>> [ 1997.353855] #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
>>>> [ 1997.353868] Call Trace:
>>>> [ 1997.353870] <TASK>
>>>> [ 1997.353873] dump_stack_lvl+0x91/0xb0
>>>> [ 1997.353877] __might_resched+0x15f/0x290
>>>> [ 1997.353882] kernel_init_pages+0x4b/0xa0
>>>> [ 1997.353886] get_page_from_freelist+0x406/0x1e60
>>>> [ 1997.353895] __alloc_frozen_pages_noprof+0x1d8/0x1730
>>>> [ 1997.353912] alloc_pages_mpol+0xa4/0x190
>>>> [ 1997.353917] alloc_pages_noprof+0x59/0xd0
>>>> [ 1997.353919] get_free_pages_noprof+0x11/0x40
>>>> [ 1997.353921] __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
>>>> [ 1997.353923] __zap_vma_range+0x1bbd/0x1f40
>>>> [ 1997.353931] unmap_vmas+0xd9/0x1d0
>>>> [ 1997.353934] exit_mmap+0x10a/0x430
>>>> [ 1997.353943] __mmput+0x3d/0x130
>>>> [ 1997.353947] do_exit+0x2a7/0xae0
>>> tlb_next_batch() is (fortunately) using GFP_NOWAIT. Perhaps you can
>>> alter your patch to not call the cond_resched() if caller is attempting
>>> an atomic allocation.
>>
>> Thanks Vlastimil, David, Andrew, and Raghu for the reviews.
>>
>> After looking into this more, I think adding cond_resched() here was
>> overkill. I agree that dropping cond_resched() and
>> PROCESS_PAGES_NON_PREEMPT_BATCH entirely and just calling clear_pages()
>> is the right approach. There's no case where cond_resched() in
>> kernel_init_pages() is both necessary and safe:
>>
>> - It's unsafe in atomic context, as the BUG shows (tlb_next_batch()
>> allocates under PTE lock + RCU read lock via GFP_NOWAIT).
>> - It's unnecessary for common allocations (order-0, mTHP, 2MB) which
>> clear in well under 1ms.
>> - For 1 GiB hugepages, kernel_init_pages() only runs during the
>> initial admin-triggered allocation. When processes later fault on
>> those pages, clearing goes through folio_zero_user() ->
>> clear_contig_highpages(), not kernel_init_pages().
>>
>> So rather than guarding cond_resched() with GFP flags (as Andrew
>> suggested), I'll remove it entirely in v2 to keep things simple and
>> same scheduling characteristics as the original code, just with the
>> batch clearing performance benefit.
>>
>> Regarding the 512 MiB arm64 case that David mentioned the stall from
>> clearing that without cond_resched() under PREEMPT_NONE is acceptable,
>> or should it be handled differently?

Thanks David.

> I mean, it would already happen today, because there is no
> cond_resched(). So nothing to worry about I guess.

Right makes sense.

>
>> I can introduce clear_highpages_kasan_tagged() / clear_highpages()
>> helpers, or keep v2 minimal with the logic inline in
>> kernel_init_pages(). Any preference?
> I'd prefer not sprinkling IS_ENABLED(CONFIG_HIGHMEM) around and simply
> calling a clear_highpages_kasan_tagged() from kernel_init_pages().
>
Sounds good. I'll add a clear_highpages_kasan_tagged() helper in
highmem.h and have kernel_init_pages() call it directly. Will send
v2 shortly.

Regards,
Hrushikesh