Re: [PATCH] fuse: when copying a folio delay the mark dirty until the end
From: Joanne Koong
Date: Fri Mar 20 2026 - 13:29:56 EST
On Wed, Mar 18, 2026 at 9:27 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> On Wed, Mar 18, 2026 at 06:32:25PM -0700, Joanne Koong wrote:
> > On Wed, Mar 18, 2026 at 2:52 PM Bernd Schubert <bernd@xxxxxxxxxxx> wrote:
> > >
> > > Hi Joanne,
> > >
> > > On 3/18/26 22:19, Joanne Koong wrote:
> > > > On Wed, Mar 18, 2026 at 7:03 AM Horst Birthelmer <horst@xxxxxxxxxxxxx> wrote:
> > > >>
> > > >> Hi Joanne,
> > > >>
> > > >> I wonder, would something like this help for large folios?
> > > >
> > > > Hi Horst,
> > > >
> > > > I don't think it's likely that the pages backing the userspace buffer
> > > > are large folios, so I think this may actually add extra overhead with
> > > > the extra folio_test_dirty() check.
> > > >
> > > > From what I've seen, the main cost that dwarfs everything else for
> > > > writes/reads is the actual IO, the context switches, and the memcpys.
> > > > I think compared to these things, the set_page_dirty_lock() cost is
> > > > negligible and pretty much undetectable.
> > >
> > >
> > > a little bit background here. We see in cpu flame graphs that the spin
> > > lock taken in unlock_request() and unlock_request() takes about the same
> > > amount of CPU time as the memcpy. Interestingly, only on Intel, but not
> > > AMD CPUs. Note that we are running with out custom page pinning, which
> > > just takes the pages from an array, so iov_iter_get_pages2() is not used.
> > >
> > > The reason for that unlock/lock is documented at the end of
> > > Documentation/filesystems/fuse/fuse.rst as Kamikaze file system. Well we
> > > don't have that, so for now these checks are modified in our branches to
> > > avoid the lock. Although that is not upstreamable. Right solution is
> > > here to extract an array of pages and do that unlock/lock per pagevec.
> > >
> > > Next in the flame graph is setting that set_page_dirty_lock which also
> > > takes as much CPU time as the memcpy. Again, Intel CPUs only.
> > > In the combination with the above pagevec method, I think right solution
> > > is to iterate over the pages, stores the last folio and then set to
> > > dirty once per folio.
> >
> > Thanks for the background context. The intel vs amd difference is
> > interesting. The approaches you mention sound reasonable. Are you able
> > to share the flame graph or is this easily repro-able using fio on the
> > passthrough_hp server?
> >
> >
> > > Also, I disagree about that the userspace buffers are not likely large
> > > folios, see commit
> > > 59ba47b6be9cd0146ef9a55c6e32e337e11e7625 "fuse: Check for large folio)
> > > with SPLICE_F_MOVE". Especially Horst persistently runs into it when
> > > doing xfstests with recent kernels. I think the issue came up first time
> >
> > I think that's because xfstests uses /tmp for scratch space, so the
> >
> > "This is easily reproducible (on 6.19) with
> > CONFIG_TRANSPARENT_HUGEPAGE_SHMEM_HUGE_ALWAYS=y
> > CONFIG_TRANSPARENT_HUGEPAGE_TMPFS_HUGE_ALWAYS=y"
> >
> > triggers it but on production workloads I don't think it's likely that
> > those source pages are backed by shmem/tmpfs or exist in the page
> > cache already as a large folio as the server has no control over that.
>
> /me stumbles in-thread to note that xfs gets large folios for its files'
> pagecache fairly frequently now, especially as readahead ramps up.
Oh nice, I didn't realize that. Though I wonder if the pages are
backed by xfs/ext4/etc, it seems like any high-performance server
would just use passthrough and skip splice altogether?
Thanks,
Joanne
>
> Ok back to the hell that is deploying ClownStrike through a Java program
> while Firefox repeatedly drives my laptop to OOM.
>
> --D
>
> > I also don't think most applications use splice, though maybe I'm
> > wrong here.
> >
> > For non-splice, even if the user sets
> > "/sys/kernel/mm/transparent_hugepage/enabled" to 'always' or in
> > libfuse we do madvise on the buffer allocation for huge pages, that
> > has a 2 MB granularity requirement which depends on the user system
> > also having explicitly upped the max pages limit through the sysctl
> > since the kernel fuse max pages limit is 256 (1 MB) by default. I
> > don't think that is common on most servers.
> >
> > Thanks,
> > Joanne
> >
> > > with 3.18ish.
> > >
> > > One can further enforce that by setting
> > > "/sys/kernel/mm/transparent_hugepage/enabled" to 'always', what I did
> > > when I tested the above commit. And actually that points out that
> > > libfuse allocations should do the madvise. I'm going to do that during
> > > the next days, maybe tomorrow.
> > >
> > >
> > > Thanks,
> > > Bernd
> >