Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

From: Simon Schippers

Date: Fri May 22 2026 - 04:50:53 EST

On 5/22/26 09:14, Jonas Köppeler wrote:
> On 5/19/26 10:51 PM, Simon Schippers wrote:
>> On 5/12/26 23:55, Simon Schippers wrote:
>>> On 5/12/26 15:54, Jesper Dangaard Brouer wrote:
>>>>>> Nope, I'm using a bpftrace program to keep track of the inflight/limit
>>>>>> in a BPF hashmap. Reading from /sys will not be accurate.
>>>>> Ah nice.
>>>> Add the option --hist to have both NAPI and BQL histograms printed when
>>>> script ends. This will give you an accurate pattern of how inflight and
>>>> limit evolves.
>>>>
>>>>>> I moved the selftests into a github repo [1] to allow us to collaborate
>>>>>> and evaluate the changes more easily. I explicitly kept the new BPF
>>>>>> based BQL tracking as a commit[2] for your benefit.
>>>>>>
>>>>>> [1]https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests
>>>>>>
>>>>>> [2]https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977
>>>>> Thanks for sharing. After minor issues I was able to set it up
>>>>> (currently I am just using plain v5, will look at the coalescing patch
>>>>> when I find the time):
>>>>>
>>>>> Can confirm the latency reduction with the default settings, in my case
>>>>> 4.888ms to 0.241ms.
>>>>>
>>>>> With the same script I was also able to see a performance slow down:
>>>>> veth_bql_test_virtme.sh --qdisc fq_codel --nrules 0
>>>>> --> ~510 Kpps
>>>>> Same with --bql-disable
>>>>> --> ~570 Kpps
>>>>> --> 12% faster
>>>>>
>>>> Thanks for running these benchmarks.
>>>>
>>>> Notice that --nrules 0 can easily result in no-queuing (on average),
>>>> because the veth NAPI consumer is faster than the producer. You will
>>>> likely see BQL inflight=1 and sink reported avg latency very low
>>>> (remember it okay that sink get high latency penalty as long at ping
>>>> latency remains low, as that show AQM is working).
>>> I ran the benchmarks with --hist and I see what you mean.
>>> I have very similar results.
>>>
>>> Is Jonas way [1] of modifiying pktgen maybe the best option to ensure
>>> that the producer is faster than the consumer?
>>>
>>> [1] Link:https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@xxxxxxxxxxxxxx/
>>>
>>>> Hi, so what I found is that pktgen does not respect
>>>> __QUEUE_STATE_STACK_OFF. So the test data above is invalid, since it
>>>> just sent packets even if the BQL "stopped" the queue. So I patched
>>>> pktgen with the following:
>>>>
>>>> - if (unlikely(netif_xmit_frozen_or_drv_stopped(txq))) {
>>>> + if (unlikely(netif_xmit_frozen_or_stopped(txq))) {
>>>
>>> After thinking more about the implementation I see possible issues:
>>>
>>> 1. netdev_tx_completed_queue() never reports more than burst=64 packets:
>>>
>>> BQL only increments the limit if the queue was starved. That means:
>>> "The queue was over-limit in the last interval (the last time completion
>>> processing ran), and there is no more data in the queue (i.e. it’s
>>> empty)" [2]
>>> But as only 64 packets are reported at max, the queue can only grow when
>>> it is <= 64 packets. And then it can only stay at a limit >64 until the
>>> next decrease of the limit.
>>>
>>>
>>> 2. netdev_tx_completed_queue() is called in irregular intervals:
>>>
>>> If the consumer is slow it is called approx each tx_coal_usecs.
>>> But if the consumer is fast it is called way more frequent, probably
>>> in irregular intervals depending on the scheduling.
>>> However, "BQL depends on periodic completion interrupts" [2].
>>>
>>> --> How about adding something like an interrupt that triggers every
>>> 10us and calls netdev_tx_completed_queue() with n_bql collected from
>>> (multiple) veth_xdp_rcv runs? That could solve 1. and 2.
>> Hi,
>>
>> I worked on a new version (see attachment) that addresses both issues.
>>
>> The major change is that instead of tracking the timestamp and packet
>> count as local variables in veth_xdp_rcv(), they are now stored
>> persistently in veth_rq as struct veth_bql_state. This allows completions
>> to accumulate across multiple NAPI poll calls, so
>> netdev_tx_completed_queue() can report more than 64 packets at once
>> (see point 1). To get the time I am using (the fast) sched_clock() with
>> a trick to avoid issues when switching between CPUs.
>>
>> For point 2, the coalescing deadline is now checked both before the
>> receive loop (to flush completions that timed out since the previous
>> poll) and after each consumed packet, making completion intervals more
>> regular. Still the intervals can be smaller than
>> VETH_BQL_COAL_TX_USECS, but I guess this is fine.
>>
>> I also found out that the BQL limit correlates closely with
>> VETH_BQL_COAL_TX_USECS. It essentially reflects the latency we are
>> targeting. I raised the default to 100 µs to allow DQL to converge to a
>> higher limit (for reaching 255 in the testing below).
>>
>> With the patched pktgen (respecting __QUEUE_STATE_STACK_XOFF), testing
>> shows:
>> - --nrules 0: DQL limit reaches (up to) ~255
>> - --nrules 10000: DQL limit converges to ~0 (with --gro-disable)
>>
>> These results are what I would expect from a BQL algorithm, but more
>> testing is needed of course.
>>
>> What do you think?
>
> Hi,
>
> This is exactly what I had in mind for implementing the BQL algorithm
> in this case. I did some testing with pktgen of this patch and also
> compared it to the v5 version.
>
> You can find an extension of the benchmark script with pktgen here [1],
> as well as a wrapper script (veth_bql_bench.sh) to run the test script
> with and without --bql-disable to report the difference. I also
> configured pktgen to use the qdisc as suggested by Jesper.

Great, I will use your pktgen solution from now on.

Didn't know about the qdisc option, is there a performance difference
with/without it? Or is it to have ping working next to pktgen?

Consider to do a pull request :)

>
> Note: bpftrace needs to be disabled, otherwise it becomes the
> bottleneck (at least on my machine) and pktgen throughput is halved
> when enabled.

Good to know.

>
> Here are the results:
>
> v5 (not time-based):
> --nrules 0 --pktgen --no-bpftrace
> ========================================
> Results (average over 10 runs):
> ========================================
> BQL on BQL off
> --- ------ -------
> Throughput (pps) 1980871 2169898
> Ping RTT avg (ms) 0.065 0.162
> Throughput diff -8.7% // BQL 8.7% lower throughput
> RTT diff -59.9% // BQL 60% lower latency
> ========================================
>
> Simon's time-based version:
>
> Test args: --nrules 0 --pktgen --no-bpftrace
> ========================================
> Results (average over 10 runs):
> ========================================
> BQL on BQL off
> --- ------ -------
> Throughput (pps) 2166335 2153398
> Ping RTT avg (ms) 0.165 0.165
> Throughput diff 0.6%
> RTT diff 0.0%
>
> --pktgen --no-bpftrace --nrules 3500
> ========================================
> Results (average over 10 runs):
> ========================================
> BQL on BQL off
> --- ------ -------
> Throughput (pps) 28569 28696
> Ping RTT avg (ms) 1.327 8.409
> Throughput diff -0.4%
> RTT diff -84.2%
>

I think we should run benchmarks against the stock net-next to
be safe.

> Seems to work now as expected.

Yes, but I think we have to keep these points in mind:

1. Limit/Inflight can be bigger than VETH_RING_SIZE, because
packets can be enqueued in the same time as they are read out,
so netdev_tx_completed_queue() can theoretically be called with
many number of packets.
I do not think it is deal-breaking though.
I could see such high limits/inflights when looking at the /sys
BQL statistics..

2. sched_clock() is only valid on the same CPU. When a different
CPU starts executing its sched_clock() can be in the past compared
to the sched_clock() value saved by the previous CPU.
My trick...
min(s->time, sched_clock())
... avoids potentially extremely long intervals between
netdev_tx_completed_queue() calls but is not perfect of course.
I think CPU hopping happens rarely enough for this to matter..
And also we have to keep this in mind [1]:
"An architecture may or may not provide an implementation of
sched_clock() on its own. If a local implementation is not provided,
the system jiffy counter will be used as sched_clock()."

3. Inflight can be stuck at a value>0 for a long time when packet
enqueueing stops. Only when packets are enqueued again,
(on the next veth_xdp_rcv() call,) netdev_tx_completed_queue() is
executed and inflight is set to 0 again.
Can also be seen when looking at the /sys BQL statistics.

BTW: Yesterday, I worked on and refactored the code into its own .h
file as a library and it also works fine for TUN/TAP (+vhost-net)
for me :)

Thanks for your work!
Simon

[1] Link: https://docs.kernel.org/timers/timekeeping.html

>
> [1]https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark
>
> Thanks,
> Jonas
>
>> Thanks!
>>
>> BTW: I think that this implementation could also work for other
>> software interfaces.
>>
>>> [2] Link:https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83
>>>
>>>> There is an important gotcha. We actually have micro-burst of queuing
>>>> (likely due to scheduling noise). Reading BQL stats from /sys will show
>>>> BQL inflight=1, but when using the option --hist is it visible that
>>>> @inflight have a long tail (see below signature). The "qdisc" output
>>>> line also shows this happening via requeues increasing (approx 17/sec in
>>>> a test with 567Kpps). (this was with the time-based BQL impl).
>>> I understand..
>>>