Re: [RFC] mpam,x86,fs/resctrl: Generic schema description Proof of Concept

From: Reinette Chatre

Date: Thu Jun 04 2026 - 13:43:23 EST


Hi Ben,

On 6/3/26 8:15 AM, Ben Horgan wrote:
> Hi Reinette,
>
> On 5/29/26 19:06, Reinette Chatre wrote:
>> Hi Everybody,
>>
>> It has been a while since we discussed the resctrl changes required to support
>> hardware that has controls with fine granularity or hardware that has multiple
>> controls per resource. For reference, the most recent email discussion can
>> be found at [1] with a summary of discussions in last year's plumbers slides [2].
>>
>> I created a PoC that I believe supports what folks have agreed to so far. I
>> hope this can help us to restart the discussion with the goal that resctrl gains
>> support for upcoming hardware that require these features.
>
> Thank you very much for doing this work. I believe this will be very useful for
> MPAM and other architectures.

Thank you very much for reviewing this.

>
>>
>> Request regarding this PoC
>> ==========================
>>
>> Please consider this PoC as a "direction check" on the schema description and multiple
>> control discussions held thus far.
>>
>> Could folks working on enabling new hardware requiring this capability please consider
>> if this is something you can build on and how it should be improved to support these
>> upcoming capabilities?
>>
>> Opens
>> =====
>>
>> While the PoC aims to support what folks agreed on some opens remain:
>> - I attempted to make some MPAM supporting changes but these are all just compile
>> tested. While MPAM should benefit from the new control properties I did not
>> initialize them on MPAM and did not attempt refactor to separate out
>> the architecture specific control properties (more on what this means later).
>> I did attempt some MPAM refactoring that duplicates the MPAM domain to the
>> control domain and monitoring domain lists in support of there being multiple
>> controls each with its own list of control domains but it is definitely not good
>> design.
>
> I appreciate you including MPAM in this PoC. With this one line change I was
> able to boot an MPAM system and mount resctrl which appears to behave correctly.
> We can consider how best to do the code design later.
>
> --- a/drivers/resctrl/mpam_resctrl.c
> +++ b/drivers/resctrl/mpam_resctrl.c
> @@ -1697,6 +1697,8 @@ int mpam_resctrl_setup(void)
>
> /* Initialise the resctrl structures from the classes */
> for_each_mpam_resctrl_control(res, rid) {
> + INIT_LIST_HEAD(&res->resctrl_res.controls); // list_empty needs
> to work
> +
> if (!res->class)
> continue; // dummy resource
>

Thank you very much. I picked this up but ended up moving it earlier to be next
to the mon_domains list initialization. Doing so made it easier for me to follow
the initialization. That ok?

> I plumbed in support for the MB_MIN resource schema which also works under light
> testing. The only fs resctrl code change I needed was:
>
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -483,6 +483,9 @@ static inline u32 resctrl_get_default_ctrlval(struct
> resctrl_ctrl *ctrl)
> case RESCTRL_CTRL_BITMAP:
> return BIT_MASK(ctrl->cache.cbm_len) - 1;
> case RESCTRL_CTRL_SCALAR:
> + if (ctrl->name == RESCTRL_CTRL_NAME_MIN)
> + return ctrl->membw.min_bw;
> +
> return ctrl->membw.max_bw;
> }
>
>
> At least on MPAM systems, we use a default of 0 for minimum bandwidth controls
> as the maximum bandwidth controls only take effect if their value is higher than
> the minimum bandwidth value. I have specialised this on the ctrl->name which
> breaks your ctrl->type based classification but that's fixable by just adding a
> default field to membw.

This I am not sure about. In my understanding a typical "default" value means
"no throttling" and, at least on Intel, this default hardware state has been
summarized as "min" == "max" == "optimal".

Are you saying that on MPAM systems if "min" == "max" then max bandwidth controls
do not take effect? Could you please elaborate what happens if "min" == "max"?

>> - No support for emulated controls (yet). The PoC is quite large already
>> but I think it can be used as a base for emulated controls for which the software
>> controller could be a potential first customer. In this PoC mounting with
>> software controller will still display the original controller's properties.
>> - One open that needs to be addressed as part of support for emulated controls is
>> how best to display emulation relationship via resctrl hierarchy.
>
> What does emulated controls mean here? Is there some previous discussion you
> could point me at?

For emulated controls in context of MPAM I think the best reference is
https://lore.kernel.org/lkml/aPJP52jXJvRYAjjV@xxxxxxxxxxxxxxx/

Above is the email discussion that I attempted to visualize in the middle example in slide 6
("resctrl controls vs. hardware controls") of
https://lpc.events/event/19/contributions/2093/attachments/1958/4172/resctrl%20Microconference%20LPC%202025%20Tokyo.pdf

When comparing the slide to Dave's text, please replace "MB_HW" from Dave's example
with "MB_OPT" in the slide. I changed the name since I found the "HW" in an
emulated control to be potentially confusing.


>> - No support for "read-modify-write" usage of schemata file. This is where we
>> discussed (without agreement) on possibly introducing the "#" prefix to schemata
>> file entries. This PoC does not support this prefix and the current assumption/expectation
>> is that when user space changes a configuration only the new control values are
>> written to schemata file. I thus do not have a plan to support this so please
>> share opinions in this regard if you have some.
>
> There is now less motivation from the MPAM side for this than when this was
> initially discussed. In pre-upstream versions of the MPAM patches a change in
> the MB resource control value would change both the mpam h/w mbw_min and mbw_max
> values but now (on non-broken h/w) we just change the mbw_max. (mbw_min kept at 0).

Ah, thanks for the correction. The email I linked above indeed refers to changing
both min and max.

>
> However, it would be useful not to be limited by percentages. In my quick

Indeed. Not being limited by percentages while still needing to have a backward
compatible user interface is how we ended up with "emulated controls".

> experimentation with your patches I used a percentage value for MB_MIN but it
> would be best to move away from this. For new controls I think we can mandate
> that user space has to discover the resolution from the info directly but how
> can we retrofit this. For MPAM, MB and MB_MAX, would control the same things.
> Could we just add MB_MAX with a h/w friendly scale and then reflect changes in
> MB_MAX in MB and vica versa with MB taking precedent if both are set? Old
> software can continue setting MB can move to using MB_MAX and take advantage of
> the improved control. (I don't think we should expose the MPAM hardware value
> directly as it has confusion over whether all 1s is 100% or not and we'd like to
> have something generic and friendly to the user.)

Sounds to me as though you are describing emulated controls. Exposing two
controls in schemata file that essentially controls the same thing is what the
emulated controls aim to solve and the resctrl hierarchies presented in slide #6
of that presentation (and discussed in the email thread) is how we contemplated how
to represent the relationship among these controls to user space. So, considering
your example resctrl may display something like:

info//
└── MB/
└── resource_schemata/
└── MB/
└── MB_MAX/

Above hierarchy describes the relationship to user space that if MB is changed it
will impact MB_MAX and vice-versa.

The one open I am aware of surrounding emulated controls is how to present some
semblance of consistency to user space when considering all the possibilities
the different architectures (and even within architectures) may have.


>> - Controls are independent for now. This means that, for example, if a resource
>> supports a "MIN" and "MAX" control then this implementation would allow user to
>> set the "maximum" control values to be less than the "minimum" control values.
>
> I think this is ok as long as adding support for new controls in resctrl doesn't
> change the existing behaviour. In MPAM we dodged this by introducing MB as only
> affecting the h/w mbw_max and not mbw_min (as mentioned above).

I understand this to be a requirement for Intel where the spec contains "The Maximum Cap
should be programmed to be greater than or equal to the Minimum and Optimal caps.
Undesirable and undefined performance effects may result if cap programming guidelines
are not followed."

I am currently thinking that resctrl should not try to be too smart here and if user
space wants to make dramatic changes to min and max values then it should just ensure
the ordering is appropriate. For example, attempting to set a new min to be larger than
the old max would fail and user space should first increase the old max and then set
a new min.

>
>> - PoC supports the "bitmap" control but does not (yet) expose properties of a bitmap
>> control to the new info/<resource>/resource_schemata directory.
>>
>> Accessing PoC
>> =============
>>
>> Please consider the PoC as a rough draft. It has only been compile tested for Arm
>> and known to be incomplete in Arm support. To help with experimenting I only
>> fully adapted the Intel MBA resource to demo two dummy additional MBA controls.
>> All architectures should immediately benefit from the new schema descriptions
>> and new info/MB/resource_schemata hierarchy.
>>
>> I considered the patches self too many for email. Instead, the PoC can be found at:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/reinette/linux.git branch resctrl/controls_rfc_v1
>>
>> The work is based on v7.1-rc2 that also includes the following series (two of which has
>> since been queued) included:
>>
>> "selftests/resctrl: Fixes and improvements focused on Intel platforms"
>> https://lore.kernel.org/lkml/cover.1775266384.git.reinette.chatre@xxxxxxxxx/
>>
>> "x86,fs/resctrl: Improve resctrl quality and consistency"
>> https://lore.kernel.org/lkml/cover.1777419024.git.reinette.chatre@xxxxxxxxx/
>>
>> "x86,fs/resctrl: Pave the way for MPAM counter assignment"
>> https://lore.kernel.org/lkml/20260506082855.3694761-1-ben.horgan@xxxxxxx/
>>
>>
>> Primary resctrl fs data structure changes
>> =========================================
>>
>> Introduces a control represented by struct resctrl_ctrl that looks as below. To make
>> the changes easier to follow I kept some of the original names to help communicate
>> where familiar data structures land.
>>
>> What to notice about a control is that it has some common properties required
>> from all controls (scope, type, etc.) and then depending on the type of control
>> (RESCTRL_CTRL_BITMAP or RESCTRL_CTRL_SCALAR) there are type specific properties.
>>
>> /**
>> * struct resctrl_ctrl - A resource control
>> * @entry: List entry of rdt_resource::controls
>> * @scope: Scope of the resource that this control allocates
>> * @domains: RCU list of all control domains
>> * @type: The control type that determines the properties of the control,
>> * format string for displaying control values to user space, and
>> * parser of control values provided by user space.
>> * @name: Name of the control. Appended to final resource name
>> * (rdt_resource_final::name) to create final schema entry.
>> * Specifically, "rdt_resource_final::name"_"resctrl_ctrl::name".
>> * For example, with resource name "MB" and control name "MAX" the
>> * schema entry will be "MB_MAX".
>> * @cache: Cache allocation control properties.
>> * @membw: Bandwidth control properties.
>> */
>> struct resctrl_ctrl {
>> struct list_head entry;
>> enum resctrl_scope scope;
>> struct list_head domains;
>> enum resctrl_ctrl_type type;
>> enum resctrl_ctrl_name name;
>> union {
>> struct resctrl_cache cache;
>> struct resctrl_membw membw;
>> };
>> };
>>
>> Two members summarize how this new structure fits into the rest of resctrl:
>> a) resctrl_ctrl::entry
>> Since a resource can support multiple controls there is a new list
>> in struct rdt_resource named "controls" that contains the list of all
>> controls supported by the resource.
>> b) resctrl_ctrl::domains
>> Instead of the list of control domains belonging to a resource they
>> now belong to the control self. By doing so resctrl can support resource
>> controls at different scope for the same resource. This is intended to
>> support some upcoming MPAM and RISC-V usages.
>
> Please can you expand a bit on part b).
>
> In an MPAM system we consider 3 resctrl resources, RDT_RESOURCE_L3,
> RDT_RESOURCE_L2 and RDT_RESOURCE_MBA which correspond to the L3 caches, L2
> caches and memory bandwidth on egress from the L3 caches. The domain for each of
> these corresponds to the instance of the resource. That is, for RDT_RESOURCE_L2
> there is a resource for each L2 instance, similarly for L3, and for

(I'm assuming above is typo and it is "there is a domain for each L2 instance"?)

> RDT_RESOURCE_MBA there is a domain for each L3 cache. If we were to add suport
> for controls on a new cache level, say the L4, then I'd expect to add a new
> resource. For memory bandwidth, we'd like to be able to control b/w on the L2
> egress (e.g. in a DSU). Wouldn't this too be a separate resource or would this
> be a new set of controls on the same resource?
>
> New controls on the same resource
> MB_MIN2
> MB_MAX2
> MB_PROP2
> ...
>
> or
> MB2_MIN
> MB2_MAX
> MB2_PROP


The way I currently see it is that controlling bandwidth at a different scope would
be a new set of controls associated with the MB resource. There are more scenarios
coming this way with AMD's "Global MBA" that is memory bandwidth allocation at
NUMA node scope. If I understand correctly the "CPU-less Memory Node" that Nvidia
shared at plumbers would need this also and control memory bandwidth allocation
at the NUMA node scope. A related technology is Intel's region-aware MBA, which is
still at L3 scope.

I fully agree that we need to figure out how to represent all of this to user space
without turning the interface into something unintelligible. In the end this is
required for user space to know what a domain ID represents.

Would it help to make the scope part of the control name? The ship has sailed for
MB being associated with L3 scope but this could mean the "default" scope of MB
resource is L3 (which user space can still confirm by looking at the control's
"scope" file) and the others include scope in the name? Consider for example:
https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@xxxxxxxxx/


>
> AFAIK, the DSU h/w just supports proportional bandwidth controls at the moment
> but we should consider what to do about the potential naming.

ack.

>
> In the MPAM driver, we collect MSC into components (based on instances) and
> those into classes (components of the same type). Currently, a resource is
> mapped to a single class. (Two resources may map to the same class.)
>
> I expect it is useful in the memory region and sub numa cases but I'd still
> expect the common case to be that the domains are the same within a control. Or
> am I missing something?

Domains of a control should all be at the same scope. Since the schemata file
exposes the control with the different IDs representing the instances of the
resource needing to be controlled it has to be clear to user space what the
domain ID represents.

>
>>
>> Example architectural data structure changes
>> ============================================
>>
>> An architecture can use the new control by following a similar pattern to
>> resource and domain use by architectures. Consider the following for x86
>> where a new architecture specific struct resctrl_hw_ctrl includes
>> struct resctrl_ctrl and any architecture private data needed to support
>> the control:
>>
>> /*
>> * struct resctrl_hw_ctrl - Arch private properties of a resource control
>> * @r_ctrl: Control properties exposed to resctrl file system
>> * @msr_base: Base MSR address where control values should be programmed
>> * @msr_update: Function pointer to update control values
>> */
>> struct resctrl_hw_ctrl {
>> struct resctrl_ctrl r_ctrl;
>> unsigned int msr_base;
>> void (*msr_update)(struct msr_param *m);
>> };
>>
>> Structure of patch series
>> =========================
>>
>> As a PoC the series is not perfectly structured but to help navigate this work
>> on a high level the changes can be categorized as follows:
>>
>> Patch 1 to 11:
>> With a vision of what a "control" is, remove unused/unnecessary
>> members, make clear what is a *resource* property vs a *control*
>> property, do some renaming to help with the PoC.
>
> A few of the changes are generic cleanup and could hopefully be dealt with
> before decisions on the larger PoC are made. I see:
> fs/resctrl: Remove unused resctrl_membw::mb_map
> x86,fs/resctrl: Remove "arch_needs_linear"
> Perhaps a few more.

ack.

>>
>> Patch 12:
>> Introduce struct resctrl_ctrl and re-arrange existing struct rdt_resource
>> members to form part of new rdt_resource::ctrl
>>
>> Patch 13 to 44:
>> A lot of wrangling to introduce struct resctrl_ctrl to all code that needs
>> to work with a control and/or domain without assuming that the control is
>> the one and only control embedded in the resource it belongs to. Essentially,
>> a lot of changes passing the control around in addition to the resource/domain.
>
> You mention a few times in the commit message that you expect the cache
> resources to only have one control. On MPAM we have CMAX (and there looks to be
> a RISC-V equivalent) where the total number of bytes in the cache for a given
> closid is limited. The allocation must still respect the CPBM bitmap though.
> Looking at the code though I don't see much problem in adding this as an
> additional control. The assumption that these patches is making is not that
> there is only one control for cache resources but rather that cache portions are
> managed by the default cache resource control. Am I missing something or does
> that assessment make sense to you?

Your assessment is correct. There are still a few assumptions built into resctrl
about there only being a single cache control and it being a bitmap control.
Since I am not familiar with the other possible cache controls I instead focused
on isolating the existing cache control. When I/we have better understanding about
how additional cache controls behave this implementation can be adapted to support
it.

>
> I have been looking at adding CMAX control to resctrl and will have a go at
> basing what I have so far on top of this series.

Thank you!


...

>>> Any feedback is appreciated.
>
> Overall, this looks to be a big step in the right direction.

Glad to hear this.

Thank you very much.

Reinette