Re: [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper

From: Mateusz Guzik

Date: Sun May 17 2026 - 05:18:39 EST


On Sat, May 16, 2026 at 04:52:54PM +0200, Horst Birthelmer wrote:
> From: Horst Birthelmer <hbirthelmer@xxxxxxx>
>
> The dcache only shrinks under memory pressure, which is rarely reached
> on machines with ample RAM, so cached negative dentries can accumulate
> without bound. Give administrators a soft cap they can set,
> and a background worker that prefers negative dentries when reclaiming.
>
> Two new sysctls under /proc/sys/fs/:
>
> dentry-limit -- soft cap on nr_dentry. 0 (default)
> disables the feature; behaviour is then
> identical to before.
> dentry-limit-interval-ms -- pacing for the worker while still over
> the cap. Default 1000, minimum 1.
>
> When the cap is exceeded, a delayed_work runs in two phases:
>
> 1. iterate_supers() draining only negative dentries from every LRU.
> Positive entries are rotated past so the walk makes progress.
> DCACHE_REFERENCED is ignored here on purpose -- an admin-imposed
> cap should evict even hot negatives before any positive entry.
> 2. If still over the cap, iterate_supers() again with the same
> isolate callback the memory-pressure shrinker uses.
>
> Signed-off-by: Horst Birthelmer <hbirthelmer@xxxxxxx>
> ---
> There was a discussion at LSFMM about servers with too many cached
> negative dentries.
> That gave me the idea to keep the dentries in general limited
> if the system administrator needs it to.
>

I wrote about the negative entries problem here:

https://lore.kernel.org/linux-fsdevel/f7bp3ggliqbb7adyysonxgvo6zn76mo4unroagfcuu3bfghynu@7wkgqkfb5c43/#t

The mechanism as suggested here will end up evicting *useful* negative
entries. Granted, they will be recreated soon enough so it's not a
tragedy but it still is an avoidable perf loss.

What is needed in the long run is a mechanism which aggressively
recycles stale negative entries and recognizes which ones should be
saved for the time being.

Below some magic threshold you just allocate a new negative entry.

All new entries would get a grace period where they need to get hits and
prove useful OR get whacked. If you are at or above the threshold and
are allocating a new entry, you can whack the oldest negative one which
did not make it.

This is just one idea, what is not up for debate is the discrepancy
between small subset of negative entires with tons of hits vs the ones
which get virtually no traffic at all.

Whatever the mechanism it will have to take advantage of it.

> This is somewhat related to [1] where it would address the same
> symptoms but in a more unobtrusive way, by just garbage collecting
> the negative and then the unused cache entries.
>
> The other effect I have seen regarding this is that FUSE
> will not forget inodes (no FORGET call to the FUSE server)
> even after the latest reference has been closed until much later.
>
> In a FUSE server that mirrors the kernel cached inodes in user space
> because it has to keep a lot of private data for every node
> this puts an unnecessarry memory strain on that userspace entity
> especially if the memory is limited for its cgroup.

I don't know anything about how FUSE works. In this context I presume
you have a mount point backed by FUSE and the problematic memory usage
stems from inodes created against such a mount point.

This would suggest you would be better served with a mechanism which
allows userspace to cull some number of dentries for a given mount
point, maybe even with an optional preference for negative entries if
that's considered better for given fs.

Or to put it differently, I would look into exposing sb shrinkers to
root instead of rolling with a global scan.

> +static enum lru_status dentry_lru_isolate_negative(struct list_head *item,
> + struct list_lru_one *lru, void *arg)
> +{
> + struct list_head *freeable = arg;
> + struct dentry *dentry = container_of(item, struct dentry, d_lru);
> +
> + if (!spin_trylock(&dentry->d_lock))
> + return LRU_SKIP;

If anything of the sort is to land, you definitely want to pre-check
d_count and d_is_negative without the lock.