Re: [PATCH 1/2] mm/damon/core: optimize kdamond_apply_schemes() by inverting scheme and region loops

From: Josh Law

Date: Sun Mar 22 2026 - 18:00:09 EST

On 22 March 2026 21:44:18 GMT, SeongJae Park <sj@xxxxxxxxxx> wrote:
>Hello Josh,
>
>On Sun, 22 Mar 2026 18:46:40 +0000 Josh Law <objecting@xxxxxxxxxxxxx> wrote:
>
>> Currently, kdamond_apply_schemes() iterates over all targets, then over all
>> regions, and finally calls damon_do_apply_schemes() which iterates over
>> all schemes. This nested structure causes scheme-level invariants (such as
>> time intervals, activation status, and quota limits) to be evaluated inside
>> the innermost loop for every single region.
>>
>> If a scheme is inactive, has not reached its apply interval, or has already
>> fulfilled its quota (quota->charged_sz >= quota->esz), the kernel still
>> needlessly iterates through thousands of regions only to repeatedly
>> evaluate these same scheme-level conditions and continue.
>>
>> This patch inlines damon_do_apply_schemes() into kdamond_apply_schemes()
>> and inverts the loop ordering. It now iterates over schemes on the outside,
>> and targets/regions on the inside.
>>
>> This allows the code to evaluate scheme-level limits once per scheme.
>> If a scheme's quota is met or it is inactive, we completely bypass the
>> O(Targets * Regions) inner loop for that scheme. This drastically reduces
>> unnecessary branching, cache thrashing, and CPU overhead in the kdamond
>> hot path.
>
>That makes sense in high level. But, this will make a kind of behavioral
>difference that could be user-visible. I am failing at finding a clear use
>case that really depends on the old behavior. But, still it feels like not a
>small change to me.
>
>So, I'd like to be conservative to this change, unless there are good evidences
>showing very clear and impactful real world benefits. Can you share such
>evidences if you have?
>
>
>Thanks,
>SJ
>
>[...]

My last email:

Hi SeongJae,

I've looked into this further and ran some extra benchmarks on the kdamond hot path to see if the gains were actually meaningful.

The main issue right now is that kdamond spends a lot of time "spinning" through regions even when there's no work to do. For example, if a user has 10,000 regions and a few schemes that have already hit their quotas or are disabled by watermarks, the current code still iterates through every single region just to check those same flags 10,000 times.

In my tests:

Typical setup (10 schemes, 2k regions): ~3.4x faster.

Large scale (10k regions, hitting quotas): ~7x faster.

Idle schemes (watermarks off): ~7x faster.

It's also a cache locality win. Right now the CPU has to bounce between different scheme metadata inside the innermost loop for every region. Inverting the loops lets us process one scheme completely, which keeps the hot data in L1/L2 and gives about a 10% gain even when everything is active.

The goal isn't just to shave cycles, but to make DAMON scale better on high-memory systems (512GB+) where the region count is high. This keeps the background "CPU floor" much lower when DAMON is supposed to be idle or throttled.

V/R

Josh Law