Re: [PATCH] mm/damon: fix stale TLB young-state handling on arm64
From: Kunwu Chan
Date: Tue May 26 2026 - 05:04:26 EST
May 26, 2026 at 1:46 AM, "SeongJae Park" <sj@xxxxxxxxxx mailto:sj@xxxxxxxxxx?to=%22SeongJae%20Park%22%20%3Csj%40kernel.org%3E > wrote:
>
> On Mon, 25 May 2026 22:48:46 +0800 Kunwu Chan <kunwu.chan@xxxxxxxxx> wrote:
>
> >
> > From: Kunwu Chan <chentao@xxxxxxxxxx>
> >
> Thank you for this great patch!
Thanks for the reply.
>
> >
> > damon_ptep_mkold() clears the PTE Access Flag so that a later
> > access will set it again and damon_folio_young() can observe it
> > via pte_young().
> >
> > On arm64, however, ptep_test_and_clear_young() clears AF in the
> > page tables without invalidating the corresponding TLB entry.
> > Subsequent accesses can therefore continue hitting a stale TLB
> > entry without a page table walk. The PTE AF bit stays clear,
> > pte_young() reports false, and DAMON treats the region as
> > unaccessed.
> >
> > folio_set_idle() does not help here. It updates only software
> > state, and accesses through a stale TLB entry do not clear the
> > idle flag.
> >
> > As a result, nr_accesses stays low regardless of the real access
> > pattern. DAMOS schemes fail to match, WSS estimation reports
> > zero, and actions like pageout never trigger.
> >
> You are correct. Nonetheless, we intentionally designed DAMON in this way, to
> avoid the performance overhead from the added TLB flushes. Meanwhile, we
> believed this is ok in terms of the monitoring results accuracy on real world
> production environment, becasue such environments would have large amount of
> working set that flushes TLB buffers anyway. We decided to take this way after
> a measurement [1].
>
Understood, thanks for the detailed explanation and context. We were
too focused on the testing issue and overlooked the original tradeoff
between monitoring accuracy and the overhead of TLB flushing.
> And apparently you found this behavior as problematic on a test environment
> that having small size of working set. We found a similar issue in DAMON
> selftest, and we updated the test [2] to simulate the expected real world
> production environments, rather than changing DAMON.
>
That makes sense.
The selftest change in [2] also clarified why DAMON normally works fine
in production workloads. Large and active workloads naturally exceed
TLB coverage, so accessed-bit updates eventually become visible again.
So I agree that adjusting the test instead of forcing TLB flushes in
DAMON was the right tradeoff there.
> But, this kind of question is recurring. In addition to the previous
> discussion, there were a few private inqueries for this issue. And though the
> real world production environment is the priamry target of DAMON, I understand
> it is better to support testing environment, too. So, I think it is better to
> make some changes for this issue, if it doesn't make other problems.
>
At the same time, I also understand why this keeps coming up. Test
environments can differ a lot in workload size, THP usage, VM setup,
and hardware configuration, so stable and reproducible testing is not
always easy across different systems.
So I agree it makes sense to improve this area as long as it does not
add noticeable overhead or cause problems for the normal production
case.
> >
> > Fix this by switching to ptep_clear_flush_young() and
> > pmdp_clear_flush_young().
> >
> > On arm64 these perform the required TLB invalidation after
> > clearing AF. The invalidation is deferred, but still sufficient
> > for DAMON's sampling granularity.
> >
> > On x86, ptep_clear_flush_young() is equivalent to
> > ptep_test_and_clear_young() for base pages, so there is no
> > behavioral change. pmdp_clear_flush_young() additionally performs
> > a flush at PMD level, matching the existing x86 implementation.
> >
> > On powerpc, riscv, and s390, the clear_flush variants currently
> > map back to test_and_clear implementations, so this patch does not
> > change their behavior.
> >
> This change seems much nicer and might be more optimized than my simple
> implementation of tlb flush [1] that I tested before.
>
Thanks. Yes, we were mainly trying to reuse the existing
arch-optimized helpers as much as possible.
> >
> > Reproduced on arm64 (128 CPUs, 7.1.0-rc4):
> >
> > before:
> > WSS estimation: 50th percentile error 100% (reported as zero)
> > apply_interval: schemes never tried
> >
> > after:
> > WSS estimation: 50th percentile error 0.08%
> > apply_interval: passes
> >
> And nice test results. I guess you are referring to the tests in damon-tests?
> Clarifying the context would be nice.
>
Yes, those results are from: make -C tools/testing/selftests/damon run_tests
on the arm64 test machine mentioned above.
The before/after summary was extracted from the relevant failing tests
(sysfs_update_schemes_tried_regions_wss_estimation.py and
damos_apply_interval.py) for brevity.
> Also, have you had a chance to measure the performance impact?
We haven't done detailed performance measurements yet, but we can try to
collect some numbers for the flush overhead on a few different setups.
> So, I'd like to have this change. But, unless we have very clear evidence
> showing this change is not increasing the performance overhead, I'd prefer
> making this as an optional feature.
>
We agree that making it optional sounds safer unless we have solid
evidence showing the overhead is negligible. Keeping the current
default behavior for production workloads also makes sense to me.
> For the user interface, we could add a new sysfs file for the option, say,
> 'flush_sample_tlb' under 'monitoring_attrs' directory.
>
The proposed 'flush_sample_tlb' interface under monitoring_attrs sounds
reasonable to me as well.
> For long term, I'm planning [1] to extend the data attributes monitoring
> feature so that data access becomes just one of the attributes. Once it is
> done, we could control this tlb flush option using the probes interface.
>
> I was initially thinking about asking Kunwu to wait until the data attributes
> monitoring extension is done, and add this tlb flush option on top of that.
> Because, otherwise, we may need to deprecate 'flush_sample_tlb' after the
> extension is done. But, we will anyway need to deprecate a few interfaces
> including 'nr_accesses'. Doing the deprecation of 'flush_sample_tlb' together
> with it shouldn't be huge amount of overhead.
>
> So, unless Kunwu and Lian has other concerns, I'd suggest the
> 'flush_sample_tlb' path.
Yes, we also think deprecating it later together with interfaces such as
'nr_accesses' should be manageable once the data attributes monitoring
extension is ready.
Thanks, Kunwu
>
> [...]
>
> [1] https://lore.kernel.org/20200403103059.12762-1-sjpark@xxxxxxxxxx/
> [2] https://lore.kernel.org/20260117020731.226785-3-sj@xxxxxxxxxx/
>
> Thanks,
> SJ
>