Re: [PATCH] lockd: pin next file across nlm_inspect_file lock-drop

From: Jeff Layton

Date: Sat May 23 2026 - 07:05:47 EST

On Fri, 2026-05-22 at 21:42 -0400, Michael Bommarito wrote:
> nlm_traverse_files() walks each nlm_files[] hash bucket with
> hlist_for_each_entry_safe(file, next, ...). For each matching file
> it bumps f_count, drops nlm_file_mutex to run nlm_inspect_file()
> (which may sleep walking blocks, shares, and the inode lock list),
> then reacquires the mutex and decrements f_count before continuing
> to the saved next.
>
> The f_count bump pins the current file across the lock-drop, but
> nothing pins next. Any nlmsvc thread that holds the last reference
> on the file at next will, during that window, call
> nlm_release_file() -> nlm_delete_file() under nlm_file_mutex,
> hlist_del() it from the bucket, and kfree() it. When
> nlm_traverse_files() reacquires the mutex and the macro reads the
> next entry's f_list.next on the following iteration, the read lands
> in the freed slab.
>
> A naive restart-on-action variant would deadlock-spin against an
> nlm_release_file holder: nlm_inspect_file() does not always drain
> the file (it can return 1 with an RPC still holding f_count above
> the cleanup threshold), and the outer predicate is_failover_file()
> matches static attributes of the file, so a restart can keep
> re-finding the same un-cleanable file until the external RPC ref
> drops.
>
> Pin the neighbour explicitly instead. Walk the bucket with two
> locally-pinned cursors at a time: file (current, pinned by the
> prior iteration's next bump) and next (one ahead). Drop file's pin
> at the end of each iteration, then advance to next, which is still
> alive because we hold its f_count above zero across the unlock.
> This bounds the walk at O(N) per bucket and never observes a freed
> neighbour. Factor the f_count/list/share/lock cleanup into a
> helper so the no-match path also drops a stale empty file rather
> than leaving it in the table.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Fixes: 01df9c5e918a ("LOCKD: Fix a deadlock in nlm_traverse_files()")
> Assisted-by: Claude:claude-opus-4-7
> Signed-off-by: Michael Bommarito <michael.bommarito@xxxxxxxxx>
> ---
> fs/lockd/svcsubs.c | 61 +++++++++++++++++++++++++++++++---------------
> 1 file changed, 42 insertions(+), 19 deletions(-)
>
> Reproduced under UML + KASAN with a loopback NFSv3 mount, 768
> concurrent POSIX fcntl(F_SETLKW) holders, and parallel writes to
> /proc/fs/nfsd/unlock_filesystem forcing nlmsvc_unlock_all_by_sb()
> to walk the table while clients churn locks.
>
> Stock kernel:
>
> BUG: KASAN: slab-use-after-free in nlm_traverse_files+0x71d/0x9d0
> Read of size 8 at addr 0000000070314800 by task nlm-init-...
>
> Allocated by: nlm_lookup_file via nlm4svc_proc_lock
> Freed by: another nlm_traverse_files instance freeing a
> file whose f_count dropped to zero during the
> nlm_inspect_file() unlock window
>
> Patched UML kernel ran the same harness silently.
>
> Pin-next was chosen over restart-on-action because the latter can
> livelock when nlm_inspect_file() returns 1 with an RPC reference
> still holding the file above the cleanup threshold and the outer
> is_failover_file() predicate matching static attributes.
>
>
> diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
> index dd0214dcb6950..2bfa32207f10c 100644
> --- a/fs/lockd/svcsubs.c
> +++ b/fs/lockd/svcsubs.c
> @@ -295,36 +295,59 @@ static void nlm_close_files(struct nlm_file *file)
> /*
> * Loop over all files in the file table.
> */
> +static void nlm_file_release(struct nlm_file *file)
> +{
> + if (list_empty(&file->f_blocks) && !file->f_locks
> + && !file->f_shares && !file->f_count) {
> + hlist_del(&file->f_list);
> + nlm_close_files(file);
> + kfree(file);
> + }
> +}
> +
> static int
> nlm_traverse_files(void *data, nlm_host_match_fn_t match,
> int (*is_failover_file)(void *data, struct nlm_file *file))
> {
> - struct hlist_node *next;
> - struct nlm_file *file;
> + struct nlm_file *file, *next;
> int i, ret = 0;
>
> mutex_lock(&nlm_file_mutex);
> for (i = 0; i < FILE_NRHASH; i++) {
> - hlist_for_each_entry_safe(file, next, &nlm_files[i], f_list) {
> - if (is_failover_file && !is_failover_file(data, file))
> - continue;
> + file = hlist_entry_safe(nlm_files[i].first,
> + struct nlm_file, f_list);
> + if (file)
> file->f_count++;
> - mutex_unlock(&nlm_file_mutex);
> -
> - /* Traverse locks, blocks and shares of this file
> - * and update file->f_locks count */
> - if (nlm_inspect_file(data, file, match))
> - ret = 1;
> + while (file) {
> + /*
> + * Pin the next neighbour before we drop the mutex
> + * for nlm_inspect_file(); a concurrent
> + * nlm_release_file() under the same mutex would
> + * otherwise be free to unlink and kfree it during
> + * the unlock window, leaving us to dereference a
> + * freed slab when we walked to next afterwards.
> + */
> + next = hlist_entry_safe(file->f_list.next,
> + struct nlm_file, f_list);
> + if (next)
> + next->f_count++;
> +
> + if (!is_failover_file || is_failover_file(data, file)) {
> + mutex_unlock(&nlm_file_mutex);
> +
> + /*
> + * Traverse locks, blocks and shares of this
> + * file and update file->f_locks count.
> + */
> + if (nlm_inspect_file(data, file, match))
> + ret = 1;
> +
> + mutex_lock(&nlm_file_mutex);
> + }
>
> - mutex_lock(&nlm_file_mutex);
> file->f_count--;
> - /* No more references to this file. Let go of it. */
> - if (list_empty(&file->f_blocks) && !file->f_locks
> - && !file->f_shares && !file->f_count) {
> - hlist_del(&file->f_list);
> - nlm_close_files(file);
> - kfree(file);
> - }
> + nlm_file_release(file);
> + file = next;
> }
> }
> mutex_unlock(&nlm_file_mutex);

Sashiko seems to think there is a regression here. See:

https://sashiko.dev/#/patchset/20260523014203.2462827-1-michael.bommarito@xxxxxxxxx?part=1
--
Jeff Layton <jlayton@xxxxxxxxxx>