Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead

From: Jesper Dangaard Brouer
Date: Fri May 16 2025 - 10:44:34 EST

Next message: Sean Christopherson: "Re: [PATCH v4 24/38] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()"
Previous message: Joshua Hahn: "Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning"
In reply to: Tariq Toukan: "Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead"
Next in thread: patchwork-bot+netdevbpf: "Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 16/05/2025 15.47, Tariq Toukan wrote:

On 15/05/2025 3:26, Alexei Starovoitov wrote:

On Wed, May 14, 2025 at 1:04 PM Tariq Toukan <tariqt@xxxxxxxxxx> wrote:

From: Carolina Jubran <cjubran@xxxxxxxxxx>

CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
zero-initializing all stack variables on function entry. The mlx5 XDP
RX path previously allocated a struct mlx5e_xdp_buff on the stack per
received CQE, resulting in measurable performance degradation under
this config.

This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
avoiding per-CQE stack allocations and repeated zeroing.

With this change, XDP_DROP and XDP_TX performance matches that of
kernels built without CONFIG_INIT_STACK_ALL_ZERO.

Performance was measured on a ConnectX-6Dx using a single RX channel
(1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
net-next-6.15.

Stack zeroing disabled:
- XDP_DROP:
* baseline: 31.47 Mpps
* baseline + per-RQ allocation: 32.31 Mpps (+2.68%)

31.47 Mpps = 31.77 nanosec per packet
32.31 Mpps = 30.95 nanosec per packet
Improvement: 0.82 nanosec faster

- XDP_TX:
* baseline: 12.41 Mpps
* baseline + per-RQ allocation: 12.95 Mpps (+4.30%)

The XDP_TX number are actually lower than I expected.
Hmm... I wonder if we regressed here(?)

12.41 Mpps = 80.58 nanosec per packet
12.95 Mpps = 77.22 nanosec per packet
Improvement: 3.36 nanosec faster

Looks good, but where are these gains coming from ?
The patch just moves mxbuf from stack to rq.
The number of operations should really be the same.

I guess it's cache related. Hot/cold areas, alignments, movement of other fields in the mlx5e_rq structure...

The improvements for XDP_DROP (see calc above) in nanosec is so small
that it is hard to measure accurately/stable on any system.

The improvement for XDP_TX is above 2 nanosec, which looks like an actual improvement...

Stack zeroing enabled:
- XDP_DROP:
* baseline: 24.32 Mpps
* baseline + per-RQ allocation: 32.27 Mpps (+32.7%)

This part makes sense.

Yes, this makes sense as it is a measurable improvement.

24.32 Mpps = 41.12 nanosec per packet
32.27 Mpps = 30.99 nanosec per packet
Improvement: 10.13 nanosec faster

Acked-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>

--Jesper

Next message: Sean Christopherson: "Re: [PATCH v4 24/38] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()"
Previous message: Joshua Hahn: "Re: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning"
In reply to: Tariq Toukan: "Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead"
Next in thread: patchwork-bot+netdevbpf: "Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]