Re: [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed

From: Reinette Chatre

Date: Tue May 26 2026 - 17:05:56 EST

Hi Tony,

On 5/26/26 11:27 AM, Luck, Tony wrote:
> On Tue, May 26, 2026 at 10:53:59AM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 5/26/26 8:32 AM, Luck, Tony wrote:
>>
>> Instead of deleting the patch without any comments, could you please provide some
>> insight to what problems it has and why your solution is better?
>>
>>> On Fri, May 22, 2026 at 12:15:13PM -0700, Reinette Chatre wrote:
>>>> - Adding a reference count to the domain structure to avoid the worker
>>>> needing to take CPU hotplug lock. This ended up being very complicated
>>>> with the architecture needing new APIs to manage the reference count
>>>> which cannot cleanly integrate into MPAM since it uses a single
>>>> architecture domain structure to contain both the control and monitoring
>>>> domain structures. Managing the references across mount, unmount,
>>>> online, offline, as well as worker self exit resulted in several
>>>> asymmetrical and complicated paths that were error prone. Locking also
>>>> proved to be complicated since architecture would need to initiate
>>>> domain free that will need to call back into resctrl that will take
>>>> rdtgroup_mutex which means that references need to be taken/released
>>>> without locking.
>>>
>>> I'd been working on a reference count approach too. The MPAM combined
>>> domain for control and monitoring doesn't seem insurmountable. Mostly
>>
>> While technically possible I do not think it is a clean solution to have
>> the lifetime of the control domain be controlled with a reference in the
>> monitoring domain.
>>
>>> because it seems unlikely that the problem with worker threads would
>>> ever apply to control domains. Maybe I missed something, but just adding
>>> an architecture *release() function that can be used by file system code
>>> to drop reference counts on the domain when worker threads exit seems
>>> enough.
>>
>> Did you consider the locking implications that I mention in the description
>> you quoted? More below ...
>
> I didn't run into any lockdep splats during testing. But maybe didn't
> have enough code coverage to poke into corner cases.

Your patch avoids lockdep splats by using *resctrl fs* domain reference to protect
only the *architecture* domain data that can be removed without rdtgroup_mutex while
removing the resctrl fs domain data unconditionally. I do not find this to be a sane
usage of the reference count. Since the reference is attached to the resctrl fs domain
(rdt_l3_mon_domain::kref) then I find it reasonable to expect it to protect
struct rdt_l3_mon_domain's data. Could you please elaborate why you disagree?

>>
>>>
>>> My patch below.
>>
>> heh
>>
>>>
>>> -Tony
>>>
>>>
>>> From 611fd8ad816abd37ef9a65b39175ce05907a1d41 Mon Sep 17 00:00:00 2001
>>> From: Tony Luck <tony.luck@xxxxxxxxx>
>>> Date: Thu, 21 May 2026 15:14:27 -0700
>>> Subject: [PATCH] fs,mpam,x86/resctrl: Track reference count for L3 monitor
>>> domains
>>>
>>> There are race conditions[1] when the last CPU of a domain is taken offline
>>> and a worker thread may access the domain structure after it is freed.
>>>
>>> Add a rdt_l3_mon_domain::kref to track users of the domain. Don't try
>>> to cancel worker threads when CPUs are taken offline. Just set the
>>> target CPU for the thread to nr_cpu_ids to indicate the worker needs
>>> to take action next time it runs.
>>
>> One test I have found to be useful when digging into this is to offline all CPUs
>> of a domain starting with lowest number. Since overflow worker runs on lowest number
>> and then is moved to next CPU when it goes offline this stresses this new mechanics.
>> Have you tried something similar or could you try this test with this solution?
>
> Yes. My test case ran through all CPUs in the domain in order. That
> showed an interesting artifact that when CPU 36 goes offline, the worker
> next runs on CPU37, which is the place it needs to be, but it isn't
> bound to CPU37. But since d->mbm_work_cpu (or d->cqm_work_cpu) is set
> to nr_cpus_id the worker calls schedule_delayed_work_on() to get itself
> properly bound to CPU37.
>>
>> ...
>>
>>> @@ -680,18 +692,16 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
>>>
>>> switch (r->rid) {
>>> case RDT_RESOURCE_L3: {
>>> - struct rdt_hw_l3_mon_domain *hw_dom;
>>> struct rdt_l3_mon_domain *d;
>>>
>>> if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
>>> return;
>>>
>>> d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
>>> - hw_dom = resctrl_to_arch_mon_dom(d);
>>> resctrl_offline_mon_domain(r, hdr);
>>> list_del_rcu(&hdr->list);
>>> synchronize_rcu();
>>> - l3_mon_domain_free(hw_dom);
>>> + kref_put(&d->kref, resctrl_arch_l3_mon_domain_release);
>>> break;
>>> }
>>
>> To me the idea behind a "domain reference count" is to provide guarantee to any holder of
>> a reference that the domain *and* its data remains accessible while it holds the reference.
>> There is the domain structure itself and then the architecture specific, for example,
>> rdt_hw_l3_mon_domain::arch_mbm_states, and the fs state, for example rdt_l3_mon_domain::mbm_states.
>
> Agreed. I'm not guaranteeing that the whole of the domain structure
> (with all the substructures that it points to) are valid. This reference
> only promises that the mbm_over, cqm_limbo, mbm_work_cpu, and cqm_work_cpu
> fields are still valid. That's enough to solve this issue, but if we
> were to adopt this patch would need some comments so that future readers
> were not led astray.

Adding reference counting to the domain structure that does not actually protect the
domain structure but instead introduces additional corner cases created just to fix one
issue is not something I am comfortable with. I also do not see adding comments to
explain all the sharp corners as "fixing" it.

>
>> Above snippet moves the freeing of the *architecture* state to be called on kref_put()
>> while *always* freeing the fs state (resctrl_offline_mon_domain()->domain_destroy_l3_mon_state()).
>>
>> A worker may thus have a reference to the domain but when it runs it runs without
>> fs state which is just a new use-after-free.
>
> For this specific case, the worker sees "nr_cpus_id" and avoids use of
> other fields that may no longer be valid.

This does not sound like proper reference counting to me.

>
>> As I mentioned in the description the release managed by architecture implies that
>> reference needs to be dropped without rdtgroup_mutex held since the architecture
>> should also call resctrl_offline_mon_domain() as part of release. Locking needs
>> to be reworked and needs to adhere to kref rules on kref_get()/kref_put() without
>> locking.
>
> That's all handled in the existing (unchanged) offline path. The worker
> threads only need to call kref_put() to release their hold on the domain
> structure that has been mostly freed (and likely has no hold from the
> file system by this point).
>
> The release functions in both X86 and MPAM don't need locks, they are
> just calling kfree()

Right. Not needing locks is possible if the reference does not actually protect
domain state.

... and just ignore the design comments.

>>
>> Reinette
>
> I see that sashiko found no issues in parts 1-5 of your series (Yay!)
> Grumbled about some issues in part 6. And then gave up before getting to
> your latest solution to the domain offline problem.
>
> Can you repost just part 9 (or maybe parts 7-9) on top of latest -rc to
> see if sashiko is happy at last?

... and then will you take a look?

Reinette