[PATCH 04/10] nfsd: dedup nfs4_client_to_reclaim inserts
From: Jeff Layton
Date: Thu May 28 2026 - 18:01:13 EST
From: Chris Mason <clm@xxxxxxxx>
nfs4_client_to_reclaim() unconditionally allocates a new
nfs4_client_reclaim, prepends it to reclaim_str_hashtbl[], and bumps
reclaim_str_hashtbl_size with no check for an existing entry for the
same client name. After a reboot with a populated recovery directory
that inflates the counter by one for every client that reclaims:
boot: load_recdir()
nfs4_client_to_reclaim(name) /* entry #1, size++ */
grace: RECLAIM_COMPLETE
__nfsd4_create_reclaim_record_grace()
nfs4_client_to_reclaim(name) /* entry #2, size++ */
inc_reclaim_complete() ends the grace period early only when
atomic_inc_return(&nn->nr_reclaim_complete) ==
nn->reclaim_str_hashtbl_size
With reclaim_str_hashtbl_size at 2N and nr_reclaim_complete capped at
N, the equality never holds and the fast end-of-grace path is dead.
The grace period always runs out the full 90-second laundromat timer,
and the shadow entry left in the hash table carries a dangling cr_clp
for any reader that walks it.
Fix nfs4_client_to_reclaim() to compute strhashval first, look the
name up with nfsd4_find_reclaim_client(), and on a hit fold the new
princhash into the existing record (if it lacks one) and return that
record without allocating or touching reclaim_str_hashtbl_size. On
kmemdup() failure during the fold-in, return NULL so
__cld_pipe_inprogress_downcall() surfaces -EFAULT to nfsdcld, matching
the miss-path contract.
Because the fold-in writes cr_princhash.data and cr_princhash.len on
a record that is already linked into reclaim_str_hashtbl[], pair the
two stores with smp_store_release() on .len after WRITE_ONCE() on
.data, and have nfsd4_cld_check_v2() read .len with smp_load_acquire()
before READ_ONCE() on .data, so a concurrent principal-hash check
cannot observe a torn (data, len) pair.
Fixes: 362063a595be ("nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld")
Assisted-by: kres:claude-opus-4-7
Signed-off-by: Chris Mason <clm@xxxxxxxx>
---
fs/nfsd/nfs4recover.c | 16 +++++++++++++---
fs/nfsd/nfs4state.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 48 insertions(+), 3 deletions(-)
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 6ea25a52d2f4..f7905aa9fdce 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -1215,6 +1215,7 @@ nfsd4_cld_check_v2(struct nfs4_client *clp)
struct cld_net *cn = nn->cld_net;
#endif
struct nfs4_client_reclaim *crp;
+ unsigned int princhashlen;
char *principal = NULL;
/* did we already find that this client is stable? */
@@ -1249,8 +1250,17 @@ nfsd4_cld_check_v2(struct nfs4_client *clp)
#endif
return -ENOENT;
found:
- if (crp->cr_princhash.len) {
+ /*
+ * nfs4_client_to_reclaim() may fold a princhash into an
+ * already-listed reclaim record concurrently with this read.
+ * Pair with the smp_store_release() on cr_princhash.len there:
+ * if we observe a non-zero len we must also observe the
+ * matching .data pointer.
+ */
+ princhashlen = smp_load_acquire(&crp->cr_princhash.len);
+ if (princhashlen) {
u8 digest[SHA256_DIGEST_SIZE];
+ u8 *pdata;
if (clp->cl_cred.cr_raw_principal)
principal = clp->cl_cred.cr_raw_principal;
@@ -1259,8 +1269,8 @@ nfsd4_cld_check_v2(struct nfs4_client *clp)
if (principal == NULL)
return -ENOENT;
sha256(principal, strlen(principal), digest);
- if (memcmp(crp->cr_princhash.data, digest,
- crp->cr_princhash.len))
+ pdata = READ_ONCE(crp->cr_princhash.data);
+ if (memcmp(pdata, digest, princhashlen))
return -ENOENT;
}
crp->cr_clp = clp;
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index dc4ac541436f..3709d0ebcd99 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -9289,6 +9289,41 @@ nfs4_client_to_reclaim(struct xdr_netobj name, struct xdr_netobj princhash,
unsigned int strhashval;
struct nfs4_client_reclaim *crp;
+ /*
+ * A reclaim record for this client name may already exist (for
+ * example, populated at boot from the recovery directory before
+ * an in-grace RECLAIM_COMPLETE or an nfsdcld downcall delivers
+ * the same name). Dedup here so reclaim_str_hashtbl_size stays
+ * equal to the number of distinct client names; inc_reclaim_complete
+ * relies on that equality to end the grace period via the fast path.
+ */
+ crp = nfsd4_find_reclaim_client(name, nn);
+ if (crp) {
+ if (princhash.len && crp->cr_princhash.len == 0) {
+ void *pdata = kmemdup(princhash.data, princhash.len,
+ GFP_KERNEL);
+ if (pdata) {
+ /*
+ * crp is already linked into reclaim_str_hashtbl[]
+ * and may be examined concurrently by
+ * nfsd4_cld_check_v2(). Publish .data before .len
+ * with release semantics so any reader that
+ * observes a non-zero len via the paired
+ * smp_load_acquire() also observes the new
+ * data pointer.
+ */
+ WRITE_ONCE(crp->cr_princhash.data, pdata);
+ smp_store_release(&crp->cr_princhash.len,
+ princhash.len);
+ } else {
+ dprintk("%s: failed to allocate memory for princhash.data!\n",
+ __func__);
+ return NULL;
+ }
+ }
+ return crp;
+ }
+
name.data = kmemdup(name.data, name.len, GFP_KERNEL);
if (!name.data) {
dprintk("%s: failed to allocate memory for name.data!\n",
--
2.54.0