[RFC PATCH 0/6] Descriptor Recycling and Batch processing for CPSW
From: Siddharth Vadapalli
Date: Wed Mar 25 2026 - 08:43:01 EST
Hello,
NOTE for MAINTAINERS:
Patches in this series span 3 subsystems and I have posted this as an RFC
series to make it easy for the reviewers to understand the complete
implementation. I will eventually split the series and post them
sequentially to the respective subsystem's mailing list:
1. SoC
2. DMAEngine
3. Netdev
Series is based on commit
d1e59a469737 tcp: add cwnd_event_tx_start to tcp_congestion_ops
of the main branch of net-next tree. When I split the series in the
future, I shall base the patches for SoC and DMAEngine on linux-next
and the patches for Netdev on net-next.
This series enables batch processing for the am65-cpsw-nuss.c driver
on the transmit path (ndo_start_xmit and ndo_xdp_xmit) and transmit
completion path. Additionally, this series also recycles descriptors
instead of releasing them to the pool and reallocating them. The
difference in memory footprint without this series and with this series
is hardly noticeable (being under 1 MB).
Feedback on the implementation w.r.t. correctness, ease of use /
maintenance and configurability (sysfs based option for changing batch
size) is appreciated.
Series has been tested in the following combinations to cover edge
cases:
1. Single-Port (CPSW2G on J784S4-EVM)
2. Multi-Port (CPSW3G on AM625-SK)
3. Bidirectional TCP Iperf followed by interfaces being brought down
with traffic in flight (and TX / RX DMA Channel Teardown) followed
by interfaces being brought up and ensuring that Iperf traffic
resumes.
The primary motivation for this series is to improve performance in
terms of lowering the CPU load and achieving higher throughput for
gigabit and multi-gigabit operation.
The upcoming features that I plan to implement are:
1. Enable batch processing on RX
2. Batch processing on ICSSG similar to CPSW (since batch processing
increases latency, it might not be desirable to enable batch
processing and may be skipped as well).
The following sections capture the improvements brought about by this
series.
[1] AM625-SK with CPSW3G (multi-port / two netdevs) and single A53
processor (remaining CPUs are disabled) with each MAC Port operating
at 1 Gbps Full-Duplex.
===========================================================================
Baseline for [1]
===========================================================================
Dual TX Iperf UDP traffic at 100% CPU Load averaged over 30 seconds:
403 Mbps + 408 Mbps = 811 Mbps
Dual RX Iperf TCP traffic at 100% CPU Load averaged over 30 seconds:
336 Mbps + 331 Mbps = 667 Mbps
===========================================================================
With this series for [1]
===========================================================================
Dual TX Iperf UDP traffic at 100% CPU Load averaged over 30 seconds:
428 Mbps + 437 Mbps = 865 Mbps
Dual RX Iperf TCP traffic at 100% CPU Load averaged over 30 seconds:
332 Mbps + 337 Mbps = 669 Mbps
[2] J784S4-EVM with CPSW2G (single-port) and single A72 processor
(remaining CPUs are disabled) with the MAC Port operating at 1 Gbps Full-
Duplex.
===========================================================================
Baseline for [2]
===========================================================================
TX Iperf UDP traffic at 84% CPU Load averaged over 30 seconds:
956 Mbps
RX Iperf TCP traffic at 100% CPU Load averaged over 30 seconds:
941 Mbps
===========================================================================
With this series for [2]
===========================================================================
TX Iperf UDP traffic at 80% CPU Load averaged over 30 seconds:
956 Mbps
RX Iperf TCP traffic at 100% CPU Load averaged over 30 seconds:
941 Mbps
[3] J784S4-EVM with CPSW9G (multi-port) and single A72 processor
(remaining CPUs are disabled) with one MAC Port operating at 5 Gbps
Full-Duplex.
===========================================================================
Baseline for [3]
===========================================================================
TX Iperf UDP traffic at 100% CPU Load averaged over 30 seconds:
1.26 Gbps
RX Iperf TCP traffic at 75% CPU Load averaged over 30 seconds:
1.73 Gbps
===========================================================================
With this series for [3]
===========================================================================
TX Iperf UDP traffic at 100% CPU Load averaged over 30 seconds:
1.28 Gbps
RX Iperf TCP traffic at 75% CPU Load averaged over 30 seconds:
1.75 Gbps
Regards,
Siddharth.
Siddharth Vadapalli (6):
soc: ti: k3-ringacc: Add helper to get realtime count of free elements
soc: ti: k3-ringacc: Add helpers for batch push and pop operations
dmaengine: ti: k3-udma-glue: Add helpers for batch operations on TX/RX
DMA
net: ethernet: ti: am65-cpsw-nuss: Do not set buf_type for SKB
fragments
net: ethernet: ti: am65-cpsw-nuss: Recycle TX and RX CPPI Descriptors
net: ethernet: ti: am65-cpsw-nuss: Enable batch processing for TX / TX
CMPL
drivers/dma/ti/k3-udma-glue.c | 55 +++
drivers/net/ethernet/ti/am65-cpsw-nuss.c | 441 +++++++++++++++++++----
drivers/net/ethernet/ti/am65-cpsw-nuss.h | 31 ++
drivers/soc/ti/k3-ringacc.c | 99 +++++
include/linux/dma/k3-udma-glue.h | 12 +
include/linux/soc/ti/k3-ringacc.h | 35 ++
6 files changed, 612 insertions(+), 61 deletions(-)
--
2.51.1