Re: [PATCH V2 6/8] perf/x86/intel/uncore: Introduce PMU flags and broken state
From: Chen, Zide
Date: Wed Jun 03 2026 - 11:53:40 EST
On 6/2/2026 9:13 PM, Mi, Dapeng wrote:
>
> On 6/2/2026 1:01 AM, Zide Chen wrote:
>> Replace the boolean 'registered' field in intel_uncore_pmu with an
>> unsigned long 'flags' field, and add a PMU_BROKEN flag to track box
>> setup failures. Broken flag is sticky, means it won't be cleared
>> unless module reload or system reboot.
>>
>> Broken PMUs are skipped in the CPU hotplug and box allocation paths.
>>
>> When any box fails to initialize, the PMU is marked broken. Broken
>> PMUs reject new event assignments and skip future box setup attempts.
>> If the PMU was already registered, it remains so to avoid disrupting
>> in-flight events on other boxes.
>>
>> Signed-off-by: Zide Chen <zide.chen@xxxxxxxxx>
>> ---
>> v2:
>> - Make the broken flag sticky by clear_bit() in uncore_pmu_unregister()
>> other than zeroing out pmu->flag.
>> - In uncore_change_type_ctx(), don't stop CPU migration on broken PMU
>> for in-flight events. (Sashiko).
>> - Use box->cpu == -1 to identify inactive boxes that don't need
>> migration, no need to check uncore_die_has_box(), which is incomplete.
>> ---
>> arch/x86/events/intel/uncore.c | 44 ++++++++++++++++++++++--------
>> arch/x86/events/intel/uncore.h | 13 ++++++++-
>> arch/x86/events/intel/uncore_snb.c | 2 +-
>> 3 files changed, 46 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
>> index 283e41933ba7..f2cb3fde2dda 100644
>> --- a/arch/x86/events/intel/uncore.c
>> +++ b/arch/x86/events/intel/uncore.c
>> @@ -757,7 +757,7 @@ static int uncore_pmu_event_init(struct perf_event *event)
>>
>> pmu = uncore_event_to_pmu(event);
>> /* no device found for this pmu */
>> - if (!pmu->registered)
>> + if (!uncore_pmu_available(pmu))
>> return -ENOENT;
>>
>> /* Sampling not supported yet */
>> @@ -953,16 +953,18 @@ static int uncore_pmu_register(struct intel_uncore_pmu *pmu)
>>
>> ret = perf_pmu_register(&pmu->pmu, pmu->name, -1);
>> if (!ret)
>> - pmu->registered = true;
>> + uncore_pmu_set_registered(pmu);
>> return ret;
>> }
>>
>> static void uncore_pmu_unregister(struct intel_uncore_pmu *pmu)
>> {
>> - if (!pmu->registered)
>> + if (!uncore_pmu_registered(pmu))
>> return;
>> perf_pmu_unregister(&pmu->pmu);
>> - pmu->registered = false;
>> +
>> + /* Keep PMU_BROKEN_BIT stick. */
>> + uncore_pmu_clear_registered(pmu);
>
> Why to set PMU_BROKEN_BIT to a stick bit and doesn't clear it when
> unregistering the PMU? Per my understanding, there is no active boxes when
> uncore_pmu_unregister() is called, seem all flags can be cleared. When the
> new CPU or PCI device are hot plug-in again, it may be a new device or the
> old issue which leads to the PMU can't be initialized has been fixed, the
> uncore PMU can work correctly, the PMU_BROKEN_BIT should be set any more.
Woops, a typo here: stick->sticky.
Let's examine a theoretical flow for a particular PMU:
PCI dev on die0 online: activeboxes=1, PMU registered
PCI dev on die1 online: init_box failed, pmu->activeboxes=1, set broken
PCI dev on die0 offline: activeboxes=0, PMU unregistered
If the broken bit is cleared in uncore_pmu_unregister():
PCI dev on die0 or die2 online: activeboxes=1, PMU registered
This is wrong, because the box for die1 is not working.
In reality, box_allocate() or init_box() failures may not happen at all,
and when they do, the system may be experiencing severe out-of-memory or
other catastrophic conditions. Keeping the broken flag sticky should be
fine.
One alternative to avoid making it sticky is to use pmu->activeboxes as
pmu->die_refcnt to track online dies, not necessarily active boxes, as
v1 does.
The problem with this approach is that it could cause a pmu->die_refcnt
leak in the uncore_pci_remove() path: if init_box() fails but
pmu->die_refcnt has already been incremented, there is no chance to call
uncore_pci_pmu_unregister() because pci_get_drvdata() returns NULL.
>> }
>>
>> static void uncore_free_boxes(struct intel_uncore_pmu *pmu)
>> @@ -1153,7 +1155,12 @@ static int uncore_box_setup(struct intel_uncore_pmu *pmu,
>> {
>> int ret;
>>
>> - uncore_box_init(box);
>> + if (uncore_pmu_broken(pmu))
>> + return -ENODEV;
>> +
>> + ret = uncore_box_init(box);
>> + if (ret)
>> + goto err;
>>
>> /* First active box registers the pmu. */
>> if (atomic_inc_return(&pmu->activeboxes) > 1)
>> @@ -1167,6 +1174,16 @@ static int uncore_box_setup(struct intel_uncore_pmu *pmu,
>>
>> return 0;
>> err:
>> + /*
>> + * On failure on any box, mark the per-package PMU as broken regardless
>> + * of whether it was registered or not.
>> + *
>> + * Don't decrement refcnt to avoid other in-die CPUs from trying to set
>> + * up the PMU box again.
>> + *
>> + * Don't kfree box; MSR and MMIO boxes are freed at module exit only.
>> + */
>> + uncore_pmu_set_broken(pmu);
>> uncore_box_exit(box);
>> return ret;
>> }
>> @@ -1190,8 +1207,10 @@ static int uncore_pci_pmu_register(struct pci_dev *pdev,
>> return -EINVAL;
>>
>> box = uncore_alloc_box(type, NUMA_NO_NODE);
>> - if (!box)
>> + if (!box) {
>> + uncore_pmu_set_broken(pmu);
>> return -ENOMEM;
>> + }
>>
>> box->dieid = die;
>> box->pci_dev = pdev;
>> @@ -1504,7 +1523,8 @@ static void uncore_change_type_ctx(struct intel_uncore_type *type, int old_cpu,
>>
>> if (old_cpu < 0) {
>> WARN_ON_ONCE(box->cpu != -1);
>> - if (uncore_die_has_box(type, die, pmu->pmu_idx)) {
>> + if (uncore_die_has_box(type, die, pmu->pmu_idx) &&
>> + !uncore_pmu_broken(pmu)) {
>> box->cpu = new_cpu;
>> cpumask_set_cpu(new_cpu, &pmu->cpu_mask);
>> }
>> @@ -1512,12 +1532,14 @@ static void uncore_change_type_ctx(struct intel_uncore_type *type, int old_cpu,
>> }
>>
>> WARN_ON_ONCE(box->cpu != -1 && box->cpu != old_cpu);
>> - box->cpu = -1;
>> cpumask_clear_cpu(old_cpu, &pmu->cpu_mask);
>> - if (new_cpu < 0)
>> + if (new_cpu < 0) {
>> + box->cpu = -1;
>> continue;
>> + }
>>
>> - if (!uncore_die_has_box(type, die, pmu->pmu_idx))
>> + /* non-active box doesn't need migration. */
>> + if (box->cpu == -1)
>> continue;
>> uncore_pmu_cancel_hrtimer(box);
>> perf_pmu_migrate_context(&pmu->pmu, old_cpu, new_cpu);
>> @@ -1593,7 +1615,7 @@ static int allocate_boxes(struct intel_uncore_type **types,
>> type = *types;
>> pmu = type->pmus;
>> for (i = 0; i < type->num_boxes; i++, pmu++) {
>
> IIUC, the "type->num_boxes" actually indicates the number of PMU or the
> number of box classes. Since currently there could be multiple boxes for
> each kind of PMUs on the multiple dies device, the name "num_boxes" become
> inaccurate and much misleading. could we change it to a more accurate name,
> like "num_pmus" or "num_box_classes"?
Yes, agree. Will do more clean up in the future.
>
>> - if (pmu->boxes[die])
>> + if (pmu->boxes[die] || uncore_pmu_broken(pmu))
>> continue;
>> box = uncore_alloc_box(type, cpu_to_node(cpu));
>> if (!box)
>> diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
>> index d732b87be0a9..0adb477d9708 100644
>> --- a/arch/x86/events/intel/uncore.h
>> +++ b/arch/x86/events/intel/uncore.h
>> @@ -146,13 +146,24 @@ struct intel_uncore_pmu {
>> struct pmu pmu;
>> char name[UNCORE_PMU_NAME_LEN];
>> int pmu_idx;
>> - bool registered;
>> + unsigned long flags;
>> atomic_t activeboxes;
>> cpumask_t cpu_mask;
>> struct intel_uncore_type *type;
>> struct intel_uncore_box **boxes;
>> };
>>
>> +#define PMU_REGISTERED_BIT 0
>> +#define PMU_BROKEN_BIT 1
>> +
>> +#define uncore_pmu_registered(pmu) test_bit(PMU_REGISTERED_BIT, &(pmu)->flags)
>> +#define uncore_pmu_broken(pmu) test_bit(PMU_BROKEN_BIT, &(pmu)->flags)
>> +#define uncore_pmu_available(pmu) (uncore_pmu_registered(pmu) && \
>> + !uncore_pmu_broken(pmu))
>> +#define uncore_pmu_set_registered(pmu) set_bit(PMU_REGISTERED_BIT, &(pmu)->flags)
>> +#define uncore_pmu_set_broken(pmu) set_bit(PMU_BROKEN_BIT, &(pmu)->flags)
>> +#define uncore_pmu_clear_registered(pmu) clear_bit(PMU_REGISTERED_BIT, &(pmu)->flags)
>> +
>> struct intel_uncore_extra_reg {
>> raw_spinlock_t lock;
>> u64 config, config1, config2;
>> diff --git a/arch/x86/events/intel/uncore_snb.c b/arch/x86/events/intel/uncore_snb.c
>> index c5347920541c..055131c508ff 100644
>> --- a/arch/x86/events/intel/uncore_snb.c
>> +++ b/arch/x86/events/intel/uncore_snb.c
>> @@ -940,7 +940,7 @@ static int snb_uncore_imc_event_init(struct perf_event *event)
>>
>> pmu = uncore_event_to_pmu(event);
>> /* no device found for this pmu */
>> - if (!pmu->registered)
>> + if (!uncore_pmu_available(pmu))
>> return -ENOENT;
>>
>> /* Sampling not supported yet */