Re: Re: [PATCH] fuse: when copying a folio delay the mark dirty until the end

From: Joanne Koong

Date: Mon Mar 16 2026 - 18:06:30 EST

On Mon, Mar 16, 2026 at 1:02 PM Horst Birthelmer <horst@xxxxxxxxxxxxx> wrote:
>
> On Mon, Mar 16, 2026 at 10:29:52AM -0700, Joanne Koong wrote:
> > On Mon, Mar 16, 2026 at 8:16 AM Horst Birthelmer <horst@xxxxxxxxxxxxxx> wrote:
> > >
> > > From: Horst Birthelmer <hbirthelmer@xxxxxxx>
> > >
> > > Doing set_page_dirty_lock() for every page is inefficient
> > > for large folios.
> > > When copying a folio (and with large folios enabled,
> > > this can be many pages) we can delay the marking dirty
> > > and flush_dcache_page() until the whole folio is handled
> > > and do it once per folio instead of once per page.
> > >
> > > Signed-off-by: Horst Birthelmer <hbirthelmer@xxxxxxx>
> > > ---
> > > Currently when doing a folio copy
> > > flush_dcache_page(cs->pg) and set_page_dirty_lock(cs->pg)
> > > are called for every page.
> > >
> > > We can do this at the end for the whole folio.
> >
> > Hi Horst,
> >
> > I think these are two different entities. cs->pg is the page that
> > corresponds to the userspace buffer / pipe while the (large) folio
> > corresponds to the pages in the page cache. flush_dcache_folio(folio)
> > and flush_dcache_page(cs->pg) are not interchangeable (I don't think
> > it's likely either that the pages backing the userspace buffer/pipe
> > are large folios).
> >
> > Thanks,
> > Joanne
>
> Hi Joanne,
>
> I feel a bit embarassed ... but you are completely right.
> I was interested in solving this case:
>
> fuse_uring_args_to_ring() or fuse_uring_args_to_ring_pages()
> fuse_copy_init(&cs, true, &iter) ← cs->write = TRUE
> fuse_copy_args(&cs, num_args, args->in_pages, ...)
> if (args->in_pages)
> fuse_copy_folios(cs, arg->size, 0)
> fuse_copy_folio(cs, &ap->folios[i], ...)
>
> when we have large folios

No worries, the naming doesn't make the distinction obvious at all.
For copying out large folios right now, the copy is still page by page
due to extracting 1 userspace buffer page at a time (eg the
iov_iter_get_pages2(... PAGE_SIZE, 1, ...) call in fuse_copy_fill()).
If we pass in a pages array, iov_iter_getpages2 is able to extract
multiple pages at a time and save extra overhead with the GUP setup /
irq save+restore / pagetable walk and the extra req->waitq
locking/unlocking calls, but when I benchmarked it last year I didn't
see any noticeable performance improvements from doing this. The extra
complexity didn't seem worth it. For optimized copying, I think in the
future high-performance servers will mostly just use fuse-over-iouring
zero-copy.

Thanks,
Joanne

>
> But those are not the same.
>
> Thanks,
> Horst