Re: [RFC PATCH 4/5] mm/damon/paddr: skip free pageblocks in migration walk
From: Ravi Jonnalagadda
Date: Mon May 18 2026 - 01:39:26 EST
On Sun, May 17, 2026 at 4:38 PM SeongJae Park <sj@xxxxxxxxxx> wrote:
>
> On Sat, 16 May 2026 14:03:56 -0700 Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxx> wrote:
>
> > damon_pa_migrate() walks every PFN in a region linearly, calling
> > damon_get_folio() for each one. On sparse physical address spaces
> > (e.g., CXL-attached memory), a single DAMON region can span hundreds
> > of gigabytes where most memory is free and sitting in the buddy
> > allocator. Most page lookups are fruitless and dominate kdamond
> > tick time.
>
> On sparse address spaces, the problem would be large DAMON regions of offlined
> memory. The large DAMON regions that nearly all freed memory is another
> problem that doesn't require the sparse address spaces. If I'm not wrong, the
> above paragraph could better clarified in my opinion.
>
> >
> > Check at pageblock boundaries (2MB on x86_64) whether the block is
> > entirely free. If the first page of a pageblock is a buddy page at
> > pageblock_order or higher, the entire block is free and can be
> > skipped.
> > Similarly skip pageblocks where pfn_to_online_page() returns
> > NULL.
> >
> > This reduces the iteration from O(region_sz / PAGE_SIZE) to
> > O(region_sz / pageblock_sz) + O(populated_pages).
> >
> > buddy_order_unsafe() is used without zone->lock. A transient false
> > positive (block becomes non-free between the PageBuddy and order
> > checks) costs at most one tick of missed candidates on that block;
> > the next tick re-scans. No correctness consequence as DAMON walks
> > are best-effort.
>
> I was initially thinking this is a good and reasonable optimization approach.
> But on the second thought I get below questions.
>
> For large offlined memory space problem, couldn't we simply tune DAMON's
> monitoring regions boundary to ignore the holes?
>
> For large free memory area, is it reasonable to assume such situations? In
> production, users will try to utilize as much memory of the system as possible.
> Then, wouldn't there be such problematically large free memory area?
>
> Could you please enlighten me?
>
Hi SJ,
You're right on the first point. For static offlined memory
holes (memory hotplug gaps, partial socket population, etc.) the
right answer is configuring the monitoring region boundaries to
exclude them upfront, not making the walk skip them at runtime.
The changelog is clearer if I narrow the patch to the free-but-
online case.
On the free-online case: I agree large free memory areas are
not the steady state on a fully-utilized system. The cases I
had in mind are more limited:
- A workload using a small part of a much larger range, with
the rest left as headroom (e.g. 64 GB used of a 512 GB
range).
- Shared tiers where workloads are allocated and freed on their own
timelines. Any single piece of free memory doesn't last
long, but on a busy system there's typically a meaningful
free fraction in the range at any point -- especially on a
slower tier, where workloads prefer faster memory first
when it's available.
The patch as written is a narrow optimization for those cases:
the pageblock-aligned check is one extra read per
pageblock_nr_pages PFNs (about 1 per 512 on x86_64), so it's
effectively a no-op when the region is fully populated.
If you don't see those workloads as warranting the change, I'm
happy to drop the patch. If the framing is the issue more than
the change itself, I can respin a v2 with:
- the changelog narrowed to the free-but-online case (no
offlined-memory framing);
- any suggestions from you on sashiko's review comments.
Thanks,
Ravi
> I will hold digging deep until this high level questions are answered.
>
>
> Thanks,
> SJ
>
> [...]