Re: [PATCH] cpuidle: Deny idle entry when CPU already have IPI interrupt pending

From: Maulik Shah (mkshah)

Date: Wed Mar 25 2026 - 11:51:39 EST

On 3/24/2026 9:16 PM, Ulf Hansson wrote:
> On Mon, 16 Mar 2026 at 08:38, Maulik Shah <maulik.shah@xxxxxxxxxxxxxxxx> wrote:
>>
>> CPU can get IPI interrupt from another CPU while it is executing
>> cpuidle_select() or about to execute same. The selection do not account
>> for pending interrupts and may continue to enter selected idle state only
>> to exit immediately.
>>
>> Example trace collected when there is cross CPU IPI.
>>
>> [000] 154.892148: sched_waking: comm=sugov:4 pid=491 prio=-1 target_cpu=007
>> [000] 154.892148: ipi_raise: target_mask=00000000,00000080 (Function call interrupts)
>> [007] 154.892162: cpu_idle: state=2 cpu_id=7
>> [007] 154.892208: cpu_idle: state=4294967295 cpu_id=7
>> [007] 154.892211: irq_handler_entry: irq=2 name=IPI
>> [007] 154.892211: ipi_entry: (Function call interrupts)
>> [007] 154.892213: sched_wakeup: comm=sugov:4 pid=491 prio=-1 target_cpu=007
>> [007] 154.892214: ipi_exit: (Function call interrupts)
>>
>> This impacts performance and the above count increments.
>>
>> commit ccde6525183c ("smp: Introduce a helper function to check for pending
>> IPIs") already introduced a helper function to check the pending IPIs and
>> it is used in pmdomain governor to deny the cluster level idle state when
>> there is a pending IPI on any of cluster CPUs.
>>
>> This however does not stop CPU to enter CPU level idle state. Make use of
>> same at CPUidle to deny the idle entry when there is already IPI pending.
>>
>> With change observing glmark2 [1] off screen scores improving in the range
>> of 25% to 30% on Qualcomm lemans-evk board which is arm64 based having two
>> clusters each with 4 CPUs.
>>
>> [1] https://github.com/glmark2/glmark2
>>
>> Signed-off-by: Maulik Shah <maulik.shah@xxxxxxxxxxxxxxxx>
>> ---
>> drivers/cpuidle/cpuidle.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index c7876e9e024f9076663063ad21cfc69343fdbbe7..c88c0cbf910d6c2c09697e6a3ac78c081868c2ad 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -224,6 +224,9 @@ noinstr int cpuidle_enter_state(struct cpuidle_device *dev,
>> bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
>> ktime_t time_start, time_end;
>>
>> + if (cpus_peek_for_pending_ipi(drv->cpumask))
>> + return -EBUSY;
>
> As other reviews already pointed out, this must be called only for the
> current CPU.

Yes, addressing in v2.

>
> That said, did you play with bailing out just before the call to the
> target_state->enter()? It would be interesting to know if that changes
> the "stats" somehow.

Yes, i did play moving this and "stats" do change with differences in next idle state selection.

below is high level scenario happening when the IPI deny change is placed inside target_state->enter()
or any place which will update the rejected count.

Using a menu governor,

entered_state = target_state->enter(dev, drv, index);

- entered_state will be negative error when IPI is pending.

- This makes rejected count increment (which is okay) but it also makes dev->last_residency_ns = 0;

- For next idle entry, menu governor will log this last interval as failed,
via menu_select() -> menu_update_intervals(data, UINT_MAX);
this is because menu_reflect() is not invoked for failed entry.

- This makes 1 of last 8 intervals invalid, which don't get accounted from get_typical_interval(),
hence divisor value in this API is never reaching 8, the goto again, loop inside get_typical_interval()
will get aborted "faster". This interval logging have good influence on next 8 idle entries until
it will get replaced with meaningful value.

- In ideal case get_typical_interval() would go from divisor value reaching 8, 7 and then 6 and once its down to 6,
it gets aborted if not predicted yet, so maximum 3 trials, but with any interval invalid
it finishes faster, often without prediction.

- sometimes IPI bailouts may happen frequently, in such cases we have 3 to 4 intervals as invalid in history,
effectively making get_typical_interval() not predicting since divisor value in first loop would be less than 6.

- Not predicting via get_typical_interval() can make deeper idle selection more often by only going with
adjusted sleep lengths.

In summary,

The benefits seen by aborting the idle entry with IPI pending getting nullified when it is logged
as "rejected" due to next idle entries can go deeper.

If we ignore setting dev->last_residency_ns = 0 for rejected cases or make menu_select() ignore the
last rejected iteration to update in intervals history, this would improve the chances of get_typical_interval()
to predict a meaningful sleep length (in a separate change).

For this change, IMO its better to bail out early without logging, since CPU did not make entry to idle driver yet.

Thanks,
Maulik

>
>> +
>> instrumentation_begin();
>>
>> /*
>>
>> ---
>> base-commit: b84a0ebe421ca56995ff78b66307667b62b3a900
>> change-id: 20260316-cpuidle_ipi-4c64036f9a48
>>
>> Best regards,
>> --
>> Maulik Shah <maulik.shah@xxxxxxxxxxxxxxxx>
>>
>
> Kind regards
> Uffe