Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead
From: Jesper Dangaard Brouer
Date: Fri May 16 2025 - 10:44:34 EST
On 16/05/2025 15.47, Tariq Toukan wrote:
On 15/05/2025 3:26, Alexei Starovoitov wrote:
On Wed, May 14, 2025 at 1:04 PM Tariq Toukan <tariqt@xxxxxxxxxx> wrote:
From: Carolina Jubran <cjubran@xxxxxxxxxx>
CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
zero-initializing all stack variables on function entry. The mlx5 XDP
RX path previously allocated a struct mlx5e_xdp_buff on the stack per
received CQE, resulting in measurable performance degradation under
this config.
This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
avoiding per-CQE stack allocations and repeated zeroing.
With this change, XDP_DROP and XDP_TX performance matches that of
kernels built without CONFIG_INIT_STACK_ALL_ZERO.
Performance was measured on a ConnectX-6Dx using a single RX channel
(1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
net-next-6.15.
Stack zeroing disabled:
- XDP_DROP:
* baseline: 31.47 Mpps
* baseline + per-RQ allocation: 32.31 Mpps (+2.68%)
31.47 Mpps = 31.77 nanosec per packet
32.31 Mpps = 30.95 nanosec per packet
Improvement: 0.82 nanosec faster
- XDP_TX:
* baseline: 12.41 Mpps
* baseline + per-RQ allocation: 12.95 Mpps (+4.30%)
The XDP_TX number are actually lower than I expected.
Hmm... I wonder if we regressed here(?)
12.41 Mpps = 80.58 nanosec per packet
12.95 Mpps = 77.22 nanosec per packet
Improvement: 3.36 nanosec faster
Looks good, but where are these gains coming from ?
The patch just moves mxbuf from stack to rq.
The number of operations should really be the same.
I guess it's cache related. Hot/cold areas, alignments, movement of
other fields in the mlx5e_rq structure...
The improvements for XDP_DROP (see calc above) in nanosec is so small
that it is hard to measure accurately/stable on any system.
The improvement for XDP_TX is above 2 nanosec, which looks like an
actual improvement...
Stack zeroing enabled:
- XDP_DROP:
* baseline: 24.32 Mpps
* baseline + per-RQ allocation: 32.27 Mpps (+32.7%)
This part makes sense.
Yes, this makes sense as it is a measurable improvement.
24.32 Mpps = 41.12 nanosec per packet
32.27 Mpps = 30.99 nanosec per packet
Improvement: 10.13 nanosec faster
Acked-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
--Jesper