Re: [RFC PATCH 0/7] mm/damon: hardware-sampled access reports + AMD IBS Op example
From: Ravi Jonnalagadda
Date: Mon May 25 2026 - 20:05:28 EST
On Wed, May 20, 2026 at 5:32 PM SeongJae Park <sj@xxxxxxxxxx> wrote:
>
> On Wed, 20 May 2026 12:01:43 -0700 Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxx> wrote:
>
> > On Mon, May 18, 2026 at 11:19 PM SeongJae Park <sj@xxxxxxxxxx> wrote:
> > >
> > > + Akinobu
> > >
> > > Hello Ravi,
> > >
> > > On Sat, 16 May 2026 15:34:25 -0700 Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxx> wrote:
> > >
> > > > Hi all,
> > > >
> > > > This is an RFC, not for merge. The series exercises and validates
> > > > damon_report_access() -- the consumer API SeongJae introduced in [1]
> > > > -- as a substrate for ingesting access reports from hardware-sampling
> > > > sources. The series includes one worked-example backend, an AMD IBS
> > > > Op module (damon_ibs.ko), that runs on Zen 3+ silicon via the
> > > > existing perf event subsystem.
> > >
> > > Thank you for sharing this great RFC series!
> > >
> > > [...]
> > > > Why a hardware-source primitive complements existing primitives
> > > > ===============================================================
> > > [...]
> > > > Both primitives produce a view of hotness that converges to the
> > > > true distribution over the aggregation interval. For systems where
> > > > the address space is small relative to the aggregation rate, this is
> > > > the right tool. On large heterogeneous-memory systems with goal-
> > > > driven schemes asking the closed-loop tuner to converge on a target
> > > > distribution, a complementary lower-latency view of accesses can
> > > > tighten the loop -- reducing the time DAMON's nr_accesses takes to
> > > > reflect the workload's actual access distribution, which in turn
> > > > reduces ramp duration and oscillation amplitude during convergence
> > > > of goal-driven schemes.
> > > >
> > > > A hardware-sampling primitive provides this complementary view:
> > > > hardware retirement records each access at its natural event rate,
> > > > with a physical address per sample, independent of TLB state and
> > > > independent of the unmap/fault path.
> > >
> > > Yes, I fully agree. Different multiple access check primitives have different
> > > characteristics.
> > >
> > > [...]
> > >
> > > > Demonstration
> > > > =============
> > > [...]
> > > > In both regimes, convergence to target is quick, and the workload's
> > > > measured DRAM share then holds within 1.3 percentage points of
> > > > target with standard deviation under 1.3 percentage points, sustained
> > > > over runs of 15-30 minutes per target.
> > >
> > > I understand this demonstration shows your AMD IBS-based version of DAMON is
> > > functioning as expected. Thank you for sharing this!
> > >
> > > [...]
> > > > What's in this series
> > > > =====================
> > > >
> > > > Patch 1. mm/damon/core: refcount ops owner module to prevent
> > > > rmmod UAF
> > > > Patch 2. mm/damon/paddr: export damon_pa_* ops for IBS module
> > > > Patch 3. mm/damon/core: replace mutex-protected report buffer
> > > > with per-CPU lockless ring
> > > > Patch 4. mm/damon/core: flat-array snapshot + bsearch in ring-
> > > > drain loop
> > > > Patch 5. mm/damon: add sysfs binding and dispatch hookup for
> > > > paddr_ibs operations
> > > > Patch 6. mm/damon/core: accept paddr_ibs in node_eligible_mem_bp
> > > > ops check
> > > > Patch 7. mm/damon/damon_ibs: add AMD IBS-based access sampling
> > > > backend
> > > >
> > > > Patches 1, 3, and 4 are general infrastructure that benefits any
> > > > consumer of damon_report_access(). Patches 2, 5, 6, and 7 are the
> > > > worked-example backend (paddr_ibs ops, sysfs binding, IBS module).
> > >
> > > I didn't read the detailed code of each patch. But my high level understanding
> > > is as below.
> > >
> > > Patches 1 and 2 are needed for supporting loadable module-based DAMON operation
> > > sets (access sampling backend).
> > >
> > > Patch 3 is needed for supporting access check primitives that can provide the
> > > access information in only nmi context. It can also speedup the access
> > > reporting in general, though.
> > >
> > > Patch 4 makes DAMON's internal reported access information retrieval faster, so
> > > will help any reporting-based DAMON operation set use case.
> > >
> > > Patches 5-7 are required for only the IBS-based DAMON operations set
> > > (paddr_ibs).
> > >
> > > So I agree patch 4 is a general infrastructure improvement that benefits
> > > multiple use cases.
> > >
> > > Patch 3 is also arguably general infrastructure improvement, as it will make
> > > the reporting faster in general.
> > >
> > > Patch 1 is not technically coupled with paddr_ibs, and will be needed for
> > > general loadable module based access check primitives. But, should we support
> > > lodable modules? If so, why?
> > >
> > > Patch 2 is also not technically coupled with paddr_ibs, to my understanding, so
> > > should be categorized together with patch 1? In other words, if we agree we
> > > should support lodable modules based DAMON operation sets, this should be
> > > useful for not only paddr_ibs but more general cases.
> > >
> > > Correct me if I'm wrong.
> > >
> > > >
> > > >
> > > > Patches worth folding into damon/next
> > > > =====================================
> > > >
> > > > Patches 1, 3, and 4 are not specific to IBS or to this RFC's
> > > > backend. Each is preparatory infrastructure that any consumer of
> > > > damon_report_access() will need:
> > > >
> > > > - Patch 1 (refcount ops owner) -- any modular ops set, including
> > > > out-of-tree backends, needs clean module unload to avoid UAF
> > > > on damon_unregister_ops.
> > > > - Patch 3 (per-CPU lockless ring) -- damon_report_access() cannot
> > > > be called from NMI context with the current mutex-protected
> > > > buffer. Hardware samplers all need NMI-safe submission.
> > > > - Patch 4 (flat-array snapshot + bsearch drain) -- the linear-
> > > > scan drain is O(reports x regions) and exceeds the sample
> > > > interval at high-CPU x large-region products. Bsearch brings
> > > > it to O(reports x log regions).
> > > >
> > > > If these belong directly on damon/next as preparatory patches for
> > > > damon_report_access() rather than living inside an IBS-specific
> > > > track, we are happy to rebase and resend them that way.
> > >
> > > So I'm bit unsure about patch 1. If we don't have a plan to support lodable
> > > modules based DAMON operations set, we might not need it for now.
> > >
> > > For patches 3 and 4, I agree those will be useful in general. Nonetheless, I'd
> > > slightly prefer to do that optimizations at the later part of the long term
> > > project.
> > >
> > > >
> > > >
> > > > Relation to prior and ongoing work
> > > > ==================================
> > > >
> > > > The IBS sampling pattern in patch 7 -- attr.config=0 to use IBS Op
> > > > default config, dc_phy_addr_valid filter, NMI-safe sample submission
> > > > -- is derived from concepts in Bharata B Rao's pghot RFC v5 [3].
> > > > The attribution header is in mm/damon/damon_ibs.c and the patch
> > > > carries a Suggested-by: trailer.
> > > >
> > > > Bharata's pghot v7 [4] introduces a different IBS driver targeting
> > > > the new IBS Memory Profiler (IBS-MProf) facility, which Bharata
> > > > describes as a facility "that will be present in future AMD
> > > > processors" -- a separate IBS instance from the one this RFC's
> > > > backend uses. This version of driver based out of v5 [3] is an
> > > > example of how DAMON can be benefited from AMD IBS Hardware
> > > > source and validates importance of IBS information indepedently.
> > > > It is not meant to be merged in the current form.
> > > > @Bharata if you see a path where IBS samples can be consumed
> > > > by DAMON at some point, will be happy to collaborate.
> > > >
> > > > Akinobu Mita's perf-event-based access-check RFC [5] explores a
> > > > configurable perf-event-driven access source for DAMON. IBS has
> > > > vendor-specific MSR setup beyond what perf_event_attr alone
> > > > expresses (e.g. dc_phy_addr_valid filtering on the produced sample,
> > > > not on the perf attr), so the IBS path here appears complementary
> > > > to [5] -- operators choose based on whether their hardware sampler
> > > > fits stock perf or needs additional kernel-side setup.
> > >
> > > So apparently there are multiple approaches to develop and use h/w-based access
> > > monitoring. Akinobu and you are trying to do that using DAMON as the frontend,
> > > and already made the working prototypes. There were more people who showed
> > > interest and will to contribute to this project other than you, too. I 100%
> > > agree h/w-based access monitoring can be useful, and I of course thinking using
> > > DAMON as the fronend is the right approach. I'm all for making this
> > > upstreamed.
> > >
> > > I was therefore spending time on thinking about in what long-term maintainable
> > > shape this capability can successfully be upstreamed. I suggested
> > > damon_report_access() as the internal interface between DAMON and the h/w-based
> > > access check primitives, and apparently we all (I, Ravi and Akinobu in this
> > > context) agreed. Akinobu thankfully revisioned his implementation based on
> > > damon_report_access() interface. Ravi also implemented this RFC based on the
> > > interface.
> > >
> > > After making the consensus with Akinobu, I was taking time on the user space
> > > interface. When I was discussing with Akinobu, my idea was extending the user
> > > interface for the page faults based monitoring v3 [1]. But, recently I decided
> > > to make this more general, so proposed data attributes monitoring extension [2]
> > > at LSFMMBPF. The patch series for the initial change [3] is merged into mm-new
> > > for more testing, today. The cover letter of the patch series is also sharing
> > > how it will be extended for h/w based access monitoring in long term.
> > >
> > > I of course want us to go in this direction. I believe you already had chances
> > > to take a look on the long term plan and didn't make some voice because you
> > > don't strongly disagree about the plan. If not, please make a voice.
> > >
> > Hi SJ,
> >
> > One layering question I'd like to flag before the plan is written,
> > since it affects how this RFC's substrate slots in:
>
> To my understanding, this RFC reuses the damon_report_access() infrastructure
> that shared with the per-CPUs/threds/writes/reads monitoring series [1]. My
> plan at the moment is to keep using it. So from high level view, I think the
> final picture would be not really different from this RFC.
>
Hi SJ,
Sorry for the delayed reply. Was away for a couple of days.
This resolves the layering question for me.
> >
> > In [3], .apply_probes is a periodic per-region classifier driven
> > from kdamond_fn after .check_accesses, in process context, that
> > applies a (folio -> bool) predicate to each region's sampling_addr
> > and accounts the results in r->probe_hits[]. damon_report_access()
> > on the other hand is a per-event delivery callback into a per-CPU
> > buffer, called from the access source (NMI for IBS / PEBS / SPE,
> > process context for page-fault-based sources). These appear to
> > me to sit at different layers - delivery vs. classification.
> >
> > The reason I want to confirm this: NMI context for HW samplers
> > precludes the operations .apply_probes can do today (no mutex, no
> > kmalloc, no sleep, no folio lookup that touches pte_lock). And
> > the data shape is inverted - .apply_probes asks "does region R's
> > sampling_addr have attribute A?", evaluated on the kdamond-chosen
> > address; an HW sample announces "PA Y was accessed at retirement
> > time T", arriving asynchronously and needing to find the region
> > it falls into. If access events end up routed through
> > .apply_probes in the long-term plan, the IBS / PEBS / SPE
> > backends would each need a deferral path under it (per-CPU ring
> > for NMI-safe submission, region mapping at drain time).
>
> It will not routed through .apply_probes, but work in a way similar to the
> damon_report_access() based design. That is, each (sampled) access event will
> syncronously call damon_report_access() with the access information. The
> information is stored in DAMON's internal data structure. The information will
> contain the access destination address, the accessor CPU/thread, whether it was
> reads or writes etc, if available.
>
> Then kdamond will read the reports in the data structure once per sampling
> interval and assess if each region got accessed or not since the last sampling
> interval.
>
> So my plan is not to reuse .apply_probes, but in terms of who consumes the
> information, it is not very different. Accessor will produce the information
> (report), and kdamond will consume those. But this is how
> damon_report_access() based structure is working on, so my understanding is
> that your RFC is also not very different. Am I missing something, or do you
> have any concern on this structure?
>
> >
> > Happy to be wrong here if you see a unified shape that handles
> > both - just want to surface the constraint before the plan is
> > written.
> >
> > On the loadable-module question for patches 1 and 2: agreed it's a
> > genuinely open architectural call, not just a paddr_ibs convenience.
> >
> > - paddr_ibs (this RFC) targets the existing IBS Op facility on
> > Zen 3+ silicon via the perf event subsystem and uses a
> > vendor-specific
> > overflow-handler filter that perf_event_attr cannot express
> > (dc_phy_addr_valid in IBS_OP_DATA3). Bharata's pghot v7
> > [pghot-v7] introduces a separate IBS driver targeting the new
> > IBS-MProf
> > facility on future AMD silicon via direct MSR programming -
> > not perf at all. These are two AMD-specific HW samplers with
> > non-overlapping silicon coverage and non-overlapping kernel
> > paths. A distro shipping a single kernel image to a fleet
> > with mixed silicon needs runtime-selectable backends, which
> > obj=y can't do across exclusive `depends on` chains.
> > - Akinobu's perf-event RFC v3 [akinobu-v3] is a useful contrast:
> > it stays builtin because it's a generic configurable
> > perf_event_attr passthrough, no vendor-specific code in the
> > overflow handler. The tristate case is specifically for the
> > backends that need vendor logic outside perf_event_attr
> > (IBS dc_phy_addr_valid, future ARM SPE record-format
> > handling, future Intel PEBS DLA quirks if they need
> > kernel-side filtering beyond what perf delivers).
>
> I'm still not familiar with IBS and perf events. Please bear in mind with me.
> My understanding is that there are vendor-specific knobs for IBS that perf
> event is not supporting. So far, that makes sense. And are you saying that
> you have to write paddr_ibs as a loadable module if you want to support the
> vendor-specific knobs? If I'm understanding you correctly could you further
> share why it cannot be done as a builtin module?
>
This RFC's IBS backend does not claim exclusive use of IBS.
The reason patch 7 ships as tristate is precedent and reuse: Bharata's
pghot v5 posted its IBS driver as a module, and I matched that shape
during development so the two consumers could potentially share IBS
plumbing instead of duplicating it. Patches 1 and 2 exist to support
loadable ops modules generally.
More clarity on the right shape may emerge once Bharata comments on
the questions below. If pghot ends up being the sole in-tree consumer
of legacy IBS Op, the loadable-ops question can be deferred to whenever
a second primitive actually needs it - patches 1 and 2 dropped, patch 7
made obj=y or folded. If both pghot and DAMON consume IBS, the
modular shape earns its keep.
Bharata - can you please weigh in on the long-term home for legacy IBS
Op (under pghot alongside IBS-MProf [pghot-v7] vs DAMON-side backend).
Related: any thoughts on the operator-facing knob to pick between
consumers when both are present? Happy to align.
Thanks,
Ravi
> [1] https://lore.kernel.org/20251208062943.68824-1-sj@xxxxxxxxxx
>
>
> Thanks,
> SJ
>
> [...]