Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead

From: Jesper Dangaard Brouer
Date: Fri May 16 2025 - 10:44:34 EST




On 16/05/2025 15.47, Tariq Toukan wrote:


On 15/05/2025 3:26, Alexei Starovoitov wrote:
On Wed, May 14, 2025 at 1:04 PM Tariq Toukan <tariqt@xxxxxxxxxx> wrote:

From: Carolina Jubran <cjubran@xxxxxxxxxx>

CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
zero-initializing all stack variables on function entry. The mlx5 XDP
RX path previously allocated a struct mlx5e_xdp_buff on the stack per
received CQE, resulting in measurable performance degradation under
this config.

This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
avoiding per-CQE stack allocations and repeated zeroing.

With this change, XDP_DROP and XDP_TX performance matches that of
kernels built without CONFIG_INIT_STACK_ALL_ZERO.

Performance was measured on a ConnectX-6Dx using a single RX channel
(1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
net-next-6.15.

Stack zeroing disabled:
- XDP_DROP:
     * baseline:                     31.47 Mpps
     * baseline + per-RQ allocation: 32.31 Mpps (+2.68%)


31.47 Mpps = 31.77 nanosec per packet
32.31 Mpps = 30.95 nanosec per packet
Improvement: 0.82 nanosec faster

- XDP_TX:
     * baseline:                     12.41 Mpps
     * baseline + per-RQ allocation: 12.95 Mpps (+4.30%)


The XDP_TX number are actually lower than I expected.
Hmm... I wonder if we regressed here(?)

12.41 Mpps = 80.58 nanosec per packet
12.95 Mpps = 77.22 nanosec per packet
Improvement: 3.36 nanosec faster

Looks good, but where are these gains coming from ?
The patch just moves mxbuf from stack to rq.
The number of operations should really be the same.


I guess it's cache related. Hot/cold areas, alignments, movement of other fields in the mlx5e_rq structure...

The improvements for XDP_DROP (see calc above) in nanosec is so small
that it is hard to measure accurately/stable on any system.

The improvement for XDP_TX is above 2 nanosec, which looks like an actual improvement...


Stack zeroing enabled:
- XDP_DROP:
     * baseline:                     24.32 Mpps
     * baseline + per-RQ allocation: 32.27 Mpps (+32.7%)

This part makes sense.


Yes, this makes sense as it is a measurable improvement.

24.32 Mpps = 41.12 nanosec per packet
32.27 Mpps = 30.99 nanosec per packet
Improvement: 10.13 nanosec faster

Acked-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>

--Jesper