Re: [PATCH] net/mlx5: Update mlx5_irq.mask when IRQ affinity changes

From: Yi Li

Date: Wed May 20 2026 - 03:23:07 EST

Hi Tariq,

Thanks for the feedback.

On 5/17/2026 4:58 PM, Tariq Toukan wrote:

>
> Hi,
>
> Thanks for the patch. Looking at the proposal, I want to discuss two distinct aspects:
>
> NAPI Execution Location: We already track effective affinity closely through irq_get_effective_affinity_mask(); NAPI processing is dynamically moved accordingly, including a forced NAPI cycle break if needed.
>

Right. I also see the driver flushes the page_pool alloc cache via page_pool_nid_changed() in mlx5e_post_rx_wqes(),
so RX data buffers follow the NAPI CPU as well.

> Memory Allocation Location: This is the core focus of your patch.
>

If I understand correctly, the current design keeps the page_pool cache
near the NAPI CPU, and the channel queues near the NIC based on numa-distance.
Please correct me if that's wrong.

> I have serious comments on the proposed implementation, but first I want to discuss the idea.
>
> We investigated a similar approach a few years ago but ultimately decided against upstreaming it due to stability concerns:
>
> High Volatility: The "current affinity" value can be extremely dynamic, potentially shifting multiple times per second depending on system load and tuning.
>
> Performance Risk: Sampling a highly volatile "current" value to allocate relatively permanent resources (like channel queues) risks severe worst-case performance regressions if the affinity shifts immediately after allocation.
>

I agree with the concern. But page_pool already samples node id change on the hot path.
Sampling affinity only in mlx5e_open_channel() might avoid most of the risk?

> Because of this, we have historically relied on numa-distance logic for channel allocations to ensure a predictable baseline.
>
> Do you have any benchmark data or specific use cases showing a clear net benefit over the existing numa-distance logic?
>

Honestly, my Nginx test showed no measurable throughput change.

It's more of a functional issue I ran into. My setup:
- 6 NUMA nodes, 32 SMT cores each
- ConnectX-6 on node0
- node1 and node2 equidistant from node0
- 63 combined queues
- default spread: 32 IRQs on node0, 16 on node1, 16 on node2

I moved the 16 IRQs from node2 to node1 via smp_affinity. The IRQs
followed, but the queues stayed on node2, and I observed a lot of
cross-node traffic to/from node2. Nginx wasn't affected -- the
bottleneck is elsewhere in my workload.

Do you think it OK to have an option so users can change queue location?
I'm not sure how common this need is in production, so I'd appreciate your idea.

Also, I agree irq_get_effective_affinity_mask() is better than the IRQ notifier:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b6c12460b54a..073239082144 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2769,6 +2769,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
{
struct net_device *netdev = priv->netdev;
struct mlx5e_channel_param *cparam;
+ const struct cpumask *eff_mask;
struct mlx5_core_dev *mdev;
struct mlx5e_xsk_param xsk;
bool async_icosq_needed;
@@ -2786,6 +2787,10 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
if (err)
return err;

+ eff_mask = irq_get_effective_affinity_mask(irq);
+ if (eff_mask)
+ cpu = cpumask_first(eff_mask);
+
err = mlx5e_channel_stats_alloc(priv, ix, cpu);
if (err)
return err;

> Best regards,
> Tariq
>
>

Thanks,
-Yi