Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

From: Christian Brauner

Date: Wed Jun 03 2026 - 09:52:59 EST

On Wed, Jun 03, 2026 at 08:45:18AM +0200, Christian Brauner wrote:
> On Tue, Jun 02, 2026 at 09:20:13PM -0700, Linus Torvalds wrote:
> > On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> > >
> > > Am I understanding correctly that this will completely break zerocopy
> > > sendfile?
> >
> > Very much, yes.
> >
> > And it's worth making it very very clear that ABSOLUTELY NONE of the
> > recent big security bugs were in splice.
> >
> > They were all in the networking and crypto code that just didn't deal
> > with shared data correctly.
> >
> > So in that sense, it's a bit sad to discuss castrating splice.
>
> Well, we're completely ignoring the fact that splice()'s locking and
> interactions with pipe_lock() are complete insanity. So unless someone
> sits down and really thinks about how to rework the locking I think
> degrading splice() is just fine.
>
> > But it's probably still the right thing to at least try.
>
> Yes.
>
> > I just suspect we'll never get real answers without going the "let's
> > just see what happens" route...
>
> Yes.

Reading this thread again I'm really amazed how willingly people argue
to remain locked into a really broken API even if they're giving a risk
but worthwhile chance to kill it for good. Anway, odd-userspace behavior
time:

David reported vmsplice01 failing in the LTP testsuite after the change:

11297 20:41:02.548383 <LAVA_SIGNAL_STARTTC vmsplice01>
11298 20:41:02.548518 tst_tmpdir.c:316: TINFO: Using /tmp/LTP_vmsZ13ZQj as tmpdir (tmpfs filesystem)
11299 20:41:02.548656 tst_test.c:2047: TINFO: LTP version: 20260130
11300 20:41:02.548793 tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260602 #1 SMP PREEMPT Tue Jun 2 18:13:29 UTC 2026 aarch64
11301 20:41:02.548932 tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
11302 20:41:02.549069 tst_test.c:1875: TINFO: Overall timeout per run is 0h 01m 30s
11303 20:41:02.549205 tst_test.c:1632: TINFO: tmpfs is supported by the test
11304 20:41:02.549340 Test timeouted, sending SIGKILL!
11305 20:41:02.549477 tst_test.c:1947: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
11306 20:41:02.549614 tst_test.c:1949: TBROK: Test killed! (timeout?)
11307 20:41:02.549751
11308 20:41:02.549887 Summary:
11309 20:41:02.550021 passed 0
11310 20:41:02.550155 failed 0
11311 20:41:02.550290 broken 1
11312 20:41:02.550450 skipped 0
11313 20:41:02.550582 warnings 0
11314 20:41:02.550710
11315 20:41:02.550838 <LAVA_SIGNAL_ENDTC vmsplice01>

So I looked at the test:

while (v.iov_len) {
/*
* in a real app you'd be more clever with poll of course,
* here we are basically just blocking on output room and
* not using the free time for anything interesting.
*/
if (poll(&pfd, 1, -1) < 0)
tst_brk(TBROK | TERRNO, "poll() failed");

written = vmsplice(pipes[1], &v, 1, 0);
if (written < 0) {
tst_brk(TBROK | TERRNO, "vmsplice() failed");
} else {
if (written == 0) {
break;
} else {
v.iov_base += written;
v.iov_len -= written;
}
}

SAFE_SPLICE(pipes[0], NULL, fd_out, &offset, written, 0);
//printf("offset = %lld\n", (long long)offset);
}

Prior to the change add_to_pipe() returns -EAGAIN the moment the pipe is
full. So iter_to_pipe stops and returns a partial count capped at pipe
capacity. For a 128K buffer over a 64K pipe the first call returns 64K,
the test drains it, call 2 returns the remaining 64K. Done.

After this change do_writev(... flags & SPLICE_F_NONBLOCK ? RWF_NOWAIT :
0) then calls pipe_write which does not stop when the pipe fills. It
blocks until the entire iovec is consumed.

I kinda think we need to preserve similar semantics.