Re: [RFC net] tls: TLS_SW sendfile() stalls at large MSS

From: Jiayuan Chen

Date: Thu Jun 04 2026 - 07:34:48 EST



On 6/4/26 2:53 PM, WindowsForum.com wrote:
Thanks for testing. The non-reproduction is maybe now the key data point. My reproducer omitted a precondition my hosts happened to meet: a low net.ipv4.tcp_notsent_lowat. To reproduce, add before running:

sysctl -w net.ipv4.tcp_notsent_lowat=16384


I see.


Root cause
----------
The stalling hosts have tcp_notsent_lowat=16384 (local web tuning); the stock default is effectively disabled. A TLS 1.3 record is 16406 bytes (TLS_MAX_PAYLOAD_SIZE 16384 + 22), just above that watermark -- so once tls_sw queues a single completed record, notsent (16406) exceeds the lowat, tcp_stream_memory_free() returns false, and tls_sw parks in sk_stream_wait_memory() holding exactly one corked record (the notsent:16406 + persist state from the original dump). With the default lowat, tls_sw keeps queuing, the MSG_MORE cork flushes at each sendfile() boundary, packets_out stays non-zero, and the persist timer never arms -- which is why stock kernels don't show it.

Three conditions must coincide:
  (a) MSG_MORE forwarded on a completed record -> the sub-MSS record is corked [the bug];
  (b) tcp_notsent_lowat < one TLS record (16406) -> tls_sw blocks after that one record instead of streaming past it [the trigger I'd omitted];
  (c) large MSS -> the record is sub-MSS, so the cork engages [the amplifier].

Confirmed by flipping only that knob: on a stalling host, restoring the default lowat -> 2.89 GiB/s; on a healthy host, setting lowat=16384 -> stalls (~0.0001 GiB/s). Everything that merely correlated (kernel build, congestion control/qdisc, wmem/rmem, tcp_mem, tcp_limit_output_bytes, CPU count, AES-GCM impl) was flip-tested and ruled out.

This doesn't change the proposed fix: clearing MSG_MORE for a full record sends it immediately, so the deadlock can't form regardless of tcp_notsent_lowat.


IMO, force-clearing the MSG_MORE flag for each record is not a good idea,

since we want multiple "APPLICATION DATA" frames in one TCP payload.


If you had not submitted your reply I don't think I would have kept testing it - hope this information is useful to the group.


Maybe we can skip the sk_stream_memory_free check if MSG_MORE is present. The lower tcp_sendmsg_locked will check it again.