Re: revisiting alloc_pages_bulks semantics?

From: Chuck Lever

Date: Thu May 28 2026 - 09:18:08 EST


On 5/28/26 5:00 AM, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 09:58:11AM -0400, Chuck Lever wrote:
>> On 5/27/26 8:19 AM, Christoph Hellwig wrote:
>>> On Wed, May 27, 2026 at 12:06:08PM +0200, Vlastimil Babka (SUSE) wrote:
>>>>> alloc_pages_bulks can do partial allocations for some reasons, and
>>>>> users usually have a fallback by either looping and calling it again
>>>>> or falling back to single page allocations. This sucks! Why can't
>>>>> we get our usual try as hard as you can semantics, requiring
>>>>> GFP_NORETRY or similar to relax it?
>>>>
>>>> If we do that, do we keep the possibility of partial success, i.e. return
>>>> how many were allocated? Seems wasteful to suceed N-1 and then throw all
>>>> away, if the caller can use a fallback only for the last one.
>>>> Do some callers need all-or-nothing semantics? Should a flag indicate which
>>>> one to use?
>>>
>>> A lot of callers (but not all) need all or nothing semantics. But
>>> freeing already allocated pages is the not a major problem - the caller
>>> just has to add a release_pages call if it didn't already have one
>>> for cleaning up later failures.
>>
>> What the svc/nfsd thread is trying to avoid is sleeping uninterruptibly
>> waiting for memory resources. That stalls server shutdown, among other
>> things.
>
> I'm not fully understanding the sentence. I guess you mean that
> you want svc_thread_should_stop to intercept some memory allocation
> waits?

That's the gist of it.


> I'm curious what you think about willy's comment, or if there is
> indeed a way to always use the pages from the beginning or end in
> sunrpc.

It is a ready supply of fresh pages, but it's not used as a simple
queue. Many places in sunrpc point to rq_pages and use it as an array.

struct xdr_buf uses this array as the middle payload section of an RPC
message:

struct xdr_buf {

struct kvec head[1], /* RPC header + non-page data */

tail[1]; /* Appended after page data */



struct bio_vec *bvec;

struct page ** pages; /* Array of pages */

unsigned int page_base, /* Start of page data */

page_len, /* Length of page data */

flags; /* Flags for data disposition */



unsigned int buflen, /* Total length of storage
buffer */
len; /* Length of XDR encoded message
*/
};

There's no actual array in struct xdr_buf: the "pages" fields points to
the rq_pages array that contains the RPC message being processed.

Today, pages are consumed from the beginning of rq_pages.


>> svc_rqst_release_pages() NULLs only the range
>>
>> [rq_respages, rq_next_page)
>>
>> after each RPC, so only that range contains NULL entries. Limit the
>> rq_respages fill in svc_alloc_arg() to that range instead of
>> scanning the full array.
>
> Does it NULL the entire range, or part of it? Because if it is the
> entire above range, you don't really need the check for NULL behavior
> at all but just point the bulk allocation to this range.
It's two phase.

Phase A prepares the thread (svc_rqst) for the next RPC, and it's done
before an RPC is ready to be processed. alloc_bulk_pages refills the
NULL entries.

Phase B is done while the client is waiting for the server to send an
RPC reply, so it has to be low latency. This is where the pages are
"removed" from the array and the array pointers are set to NULL.


Note that as struct xdr_buf is transitioned from "struct page **" to a
bvec array, we indeed have good low-risk opportunities to restructure
this code and how it uses rq_pages.


--
Chuck Lever