Re: [RFC net] tls: TLS_SW sendfile() stalls at large MSS
From: Jiayuan Chen
Date: Thu Jun 04 2026 - 07:34:48 EST
On 6/4/26 2:53 PM, WindowsForum.com wrote:
Thanks for testing. The non-reproduction is maybe now the key data point. My reproducer omitted a precondition my hosts happened to meet: a low net.ipv4.tcp_notsent_lowat. To reproduce, add before running:
sysctl -w net.ipv4.tcp_notsent_lowat=16384
I see.
Root cause
----------
The stalling hosts have tcp_notsent_lowat=16384 (local web tuning); the stock default is effectively disabled. A TLS 1.3 record is 16406 bytes (TLS_MAX_PAYLOAD_SIZE 16384 + 22), just above that watermark -- so once tls_sw queues a single completed record, notsent (16406) exceeds the lowat, tcp_stream_memory_free() returns false, and tls_sw parks in sk_stream_wait_memory() holding exactly one corked record (the notsent:16406 + persist state from the original dump). With the default lowat, tls_sw keeps queuing, the MSG_MORE cork flushes at each sendfile() boundary, packets_out stays non-zero, and the persist timer never arms -- which is why stock kernels don't show it.
Three conditions must coincide:
(a) MSG_MORE forwarded on a completed record -> the sub-MSS record is corked [the bug];
(b) tcp_notsent_lowat < one TLS record (16406) -> tls_sw blocks after that one record instead of streaming past it [the trigger I'd omitted];
(c) large MSS -> the record is sub-MSS, so the cork engages [the amplifier].
Confirmed by flipping only that knob: on a stalling host, restoring the default lowat -> 2.89 GiB/s; on a healthy host, setting lowat=16384 -> stalls (~0.0001 GiB/s). Everything that merely correlated (kernel build, congestion control/qdisc, wmem/rmem, tcp_mem, tcp_limit_output_bytes, CPU count, AES-GCM impl) was flip-tested and ruled out.
This doesn't change the proposed fix: clearing MSG_MORE for a full record sends it immediately, so the deadlock can't form regardless of tcp_notsent_lowat.
IMO, force-clearing the MSG_MORE flag for each record is not a good idea,
since we want multiple "APPLICATION DATA" frames in one TCP payload.
Maybe we can skip the sk_stream_memory_free check if MSG_MORE is present. The lower tcp_sendmsg_locked will check it again.
If you had not submitted your reply I don't think I would have kept testing it - hope this information is useful to the group.