Re: [PATCH RFC v5 12/18] riscv_cbqri: resctrl: Add L3 cache occupancy monitoring

From: Drew Fustini

Date: Tue May 26 2026 - 07:50:42 EST

On Mon, May 25, 2026 at 12:46:06AM +0000, sashiko-bot@xxxxxxxxxx wrote:
> > resctrl_arch_reset_rmid() re-arms CONFIG_EVENT with EVT_ID=Occupancy.
> > CONFIG_EVENT both resets the counter to 0 and selects the event, so
> > re-arming with the same event keeps the MCID counting after reset rather
> > than relying on sticky-last-event semantics that the CBQRI register
> > definition does not guarantee.
>
> Does the resctrl core actually invoke resctrl_arch_reset_rmid() for Occupancy
> events?
>
> The resctrl core only invokes resctrl_arch_reset_rmid() individually for MBM
> events during mon_event_read() (when rr->first is true). For
> QOS_L3_OCCUP_EVENT_ID, mon_event_read() is never called with first = true,
> and the recycle path (__check_limbo()) only reads the counter without
> resetting.
>
> Will this cause the Occupancy counter to stop functioning when an RMID is
> recycled since it is never re-armed?

I will drop the Occupancy case from resctrl_arch_reset_rmid() and update
the commit message. resctrl_arch_reset_rmid_all() will still handles
bulk resets.

> > static struct cbqri_resctrl_res cbqri_resctrl_resources[RDT_NUM_RESOURCES];
> >
> > +/*
> > + * Per-event controller table. Only events CBQRI can back occupy a
> > + * slot, so other events do not bloat the array.
> > + */
> > +#define CBQRI_MAX_EVENT QOS_L3_OCCUP_EVENT_ID
> > +static struct cbqri_controller *cbqri_resctrl_counters[CBQRI_MAX_EVENT + 1];
>
> Is this array necessary?
>
> The patch introduces this per-event controller array and populates it in
> cbqri_resctrl_pick_counters(), but it appears to be unused.
> resctrl_arch_rmid_read() dynamically resolves the controller via
> cbqri_find_ctrl_domain() instead.

I will drop cbqri_resctrl_counters[] and the corresponding
cbqri_resctrl_pick_counters() write. resctrl_arch_rmid_read() finds the
controller via cbqri_find_ctrl_domain() and hw_dom->paired_bc.

> > @@ -500,6 +630,14 @@ static int cbqri_resctrl_control_init(struct cbqri_resctrl_res *cbqri_res)
> > res->alloc_capable = ctrl->alloc_capable;
> > INIT_LIST_HEAD(&res->ctrl_domains);
> > INIT_LIST_HEAD(&res->mon_domains);
> > +
> > + if (ctrl->mon_capable && res->rid == RDT_RESOURCE_L3) {
> > + res->mon_scope = RESCTRL_L3_CACHE;
> > + res->mon.num_rmid = ctrl->mcid_count;
>
> [Severity: Medium]
> Will this expose an inflated number of RMIDs to userspace if subsequent
> controllers have fewer MCIDs?
>
> cbqri_resctrl_accumulate_caps() calculates max_rmid as the system-wide
> minimum mcid_count across all controllers, which bounds internal
> resctrl allocations.
>
> However, res->mon.num_rmid is initialized here using only the first
> controller's mcid_count. If a subsequent controller has a smaller mcid_count,
> could this cause userspace to receive unexpected "Out of RMIDs" errors when
> creating groups?

I will fix this bug in the next revision.

> > +static void cbqri_detach_cpu_from_l3_mon(struct rdt_resource *res,
> > + unsigned int cpu)
> > +{
> > + struct rdt_l3_mon_domain *mon_dom, *tmp;
> > +
> > + lockdep_assert_held(&cbqri_domain_list_lock);
> > +
> > + list_for_each_entry_safe(mon_dom, tmp, &res->mon_domains, hdr.list) {
> > + if (!cpumask_test_cpu(cpu, &mon_dom->hdr.cpu_mask))
> > + continue;
> > + cpumask_clear_cpu(cpu, &mon_dom->hdr.cpu_mask);
> > + if (cpumask_empty(&mon_dom->hdr.cpu_mask)) {
> > + cancel_delayed_work(&mon_dom->cqm_limbo);
> > + if (resctrl_is_mon_event_enabled(QOS_L3_MBM_TOTAL_EVENT_ID))
> > + cancel_delayed_work(&mon_dom->mbm_over);
> > + resctrl_offline_mon_domain(res, &mon_dom->hdr);
> > + list_del(&mon_dom->hdr.list);
> > + kfree(mon_dom);
>
> [Severity: High]
> Can this cause a use-after-free if the delayed work is executing?

Yes, I need to fix this in the next rev.

Drew