Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper

From: NeilBrown

Date: Fri May 22 2026 - 00:17:08 EST

On Wed, 20 May 2026, Amir Goldstein wrote:
> On Wed, May 20, 2026 at 9:16 AM Ian Kent <raven@xxxxxxxxxx> wrote:
> >
> > On 19/5/26 17:12, Jan Kara wrote:
> > > On Mon 18-05-26 21:39:13, Ian Kent wrote:
> > >> On 18/5/26 16:19, Jan Kara wrote:
> > >>> Hi Ian,
> > >>>
> > >>> On Mon 18-05-26 10:55:43, Ian Kent wrote:
> > >>>> On 18/5/26 07:55, NeilBrown wrote:
> > >>>>> On Fri, 15 May 2026, Horst Birthelmer wrote:
> > >>>>> According to the email you linked, a problem arises when a directory has
> > >>>>> a great many negative children. Code which walks the list of children
> > >>>>> (such as fsnotify) while holding a lock can suffer unpredictable delays
> > >>>>> and result in long lock-hold times. So maybe a limit on negative
> > >>>>> dentries for any parent is what we really want. That would be clumsy to
> > >>>>> implement I imagine.
> > >>>> But the notion of dropping the dentry in ->d_delete() on last dput() is
> > >>>> simple enough but did see regressions (the only other place in the VFS
> > >>>> besides dentry_kill() that the inode is unlinked from the dentry on
> > >>>> dput()). I wonder if the regression was related to the test itself
> > >>>> deliberately recreating deleted files and if that really is normal
> > >>>> behaviour. By itself that should prevent almost all negative dentries
> > >>>> being retained. Although file systems could do this as well (think XFS
> > >>>> inode recycling) it should be reasonable to require it be left to the
> > >>>> VFS.
> > >>>>
> > >>>> But even that's not enough given that, in my case, there would still be
> > >>>> around 4 million dentries in the LRU cache and in fsnotify there are
> > >>>> directory child traversals holding the parent i_lock "spinlock" that are
> > >>>> going to cause problems.
> > >>> Do you mean there are very many positive children of a directory?
> > >> Didn't quantify that.
> > >>
> > >> The symptom is the "Spinlock held for more than ... seconds" occurring in
> > >> the log. So there are certainly a lot of children in the list, but it's
> > >> an assumption the ratio of positive to negative entries is roughly the
> > >> same as the overall ratio in the dcache.
> > > OK, but that's not necessarily true. I have seen these complaints from the
> > > kernel but in all the cases I remember it was due to negative dentries
> > > accumultating in a particular directory. There are certain apps such as
> > > ElasticSearch which really do like creating huge amounts of negative
> > > dentries in one directory - they use hashes as filenames and use directory
> > > lookup instead of a DB table lookup and lookup lots of non-existent keys...
> >
> > Umm ... that's a good point, I hadn't paid much attention to ENOENT result
> >
> > lookups, I'll need to check on the like cycle of those, I think they do get
> >
> > hashed. That has to be the other source of negative dentries that I've
> >
> > neglected ...
> >
>
> Yes, it has been claimed that some real life workloads create a lot of those.
>
> If we can keep those at the tail of the children list, it will be best
> for the fsnotify
> iteration, which only cares about positive dentries.
>
> > >
> > >>>> so why is this traversal even retained in fsnotify?
> > >>> Not sure which traversal you mean but if you set watch on a parent, you
> > >>> have to walk all children to set PARENT_WATCHED flag so that you don't miss
> > >>> events on children...
> > >> Yes, that traversal is what I'm questioning ... again thanks.
> > >>
> > >> I think the function name is still fsnotify_set_children_dentry_flags()
> > >> in recent kernels, the subject of commit 172e422ffea2 I mentioned above.
> > > OK, thanks.
> > >
> > >> When you say miss events are you saying that accessing the parent dentry to
> > >> work out if the child needs to respond to an event is quite expensive in the
> > >> overall event processing context, that might make more sense to me ... or do
> > >> I completely not yet understand the reasoning behind the need for the flag?
> > > Close but not quite. The cost is the overhead of dget_parent() in
> > > fsnotify_parent() which is often a couple of cache cold loads and atomic
> > > instructions to find out we don't need to send any event for the current
> > > write(2) or read(2) call. It gets worse if there are many IOs happening to
> > > dentries in the same directory from multiple CPUs because instead of
> > > cache-cold loads you get a cacheline contention on the parent.
> > >
> > >>>>> But what if we move dentries to the end of the list when they become
> > >>>>> negative, and to the start of the list when they become positive? Then
> > >>>>> code which walks the child list could simply abort on the first
> > >>>>> negative.
> > >>>>>
> > >>>>> I doubt that would be quite as easy as it sounds, but it would at least
> > >>>>> be more focused on the observed symptom rather than some whole-system
> > >>>>> number which only vaguely correlates with the observed symptom.
> > >>>>>
> > >>>>> Maybe a completely different approach: change children-walking code to
> > >>>>> drop and retake the lock (with appropriate validation) periodically.
> > >>>>> What too would address the specific symptom.
> > >>>> Another good question.
> > >>>>
> > >>>> I have assumed that dropping and re-taking the lock cannot be done but
> > >>>> this is a question I would like answered as well. Dropping and re-taking
> > >>>> lock would require, as Miklos pointed out to me off-list, recording the
> > >>>> list position with say a cursor, introducing unwanted complexity when it
> > >>>> would be better to accept the cost of a single extra access to the parent
> > >>>> flags (which I assume is one reason to set the flag in the child).
> > >>> The parent access is actually more expensive than you might think. Based on
> > >>> experience with past fsnotify related performance regression I expect some
> > >>> 20% performance hit for small tmpfs writes if you add unconditional parent
> > >>> access to the write path.
> > >> That sounds like a lot for what should be a memory access of an already in
> > >> memory structure since the parent must be accessed to traverse the list of
> > >> child entries. I clearly don't fully understand the implications of what
> > >> I'm saying but there has been mention of another context ...
> > > Parent dentry is of course in memory but often cache cold - you don't need
> > > the parent to do e.g. write(2) to an already open file. You seem to be
> > > somewhat confused about the child dentry list traversal (or maybe I'm
> > > misunderstanding) - that happens only when placing the notification mark
> > > but definitely not for each IO operation.
> >
> > LOL, confusion is a pretty common state of mind for me!
> >
> >
> > I do get your point though and I am confusing the traversal with other
> >
> > operations. I think this answers the question I've been asking (maybe
> >
> > that wasn't obvious) about the reason for the traversal (ie. the reason
> >
> > to maintain a flag in the child).
> >
> >
> > While I have looked at the code here I haven't absorbed it and I
> >
> > definitely don't understand it, your continued patience is appreciated
> >
> > and will be beneficial when I get time to look at it a bit closer. I
> >
> > do still need to use a notifications mechanism to match up with Miklos's
> >
> > statmount implementation to get the full benefit of that in user space,
> >
> > if I ever get a chance to work on that again.
> >
> >
> > So it sounds like it would be worth while considering a traversal that's
> >
> > based on taking a reference on each dentry rather than a spinlock for
> >
> > the duration. It would be tricky though, for obvious reasons, like
> >
> > children added during the traversal, added overhead of getting the next
> >
> > entry reference, etc.
>
> Didn't look closely, but it feels like RCU traversal should be
> possible if entries are added to the tail, or to the END_OF_POSITIVE
> location.
>
> When we discussed the "negavites at tail" at LSFMM
> it was said that managing the transitions positive<->negative
> would be challenging, but I don't know that anyone tried to look closer at this.

I had a quick look. Most users of d_sib walk from the parent->d_children
with the parent ->d_lock held, so they shouldn't notice a movement in
the list.
The two exceptions I could find are d_walk() and the readdir code in
libfs.c.
I think the main problem case would be if they were holding a dentry as
a cursor which transitioned when the parent d_lock is dropped and retaken.

d_walk already needs to cope with a concurrent rename messing with
its cursor so possibly something similar could be used to trigger a
restart.

libfs readdir walks from a DCACHE_DENTRY_CURSOR which will never
transition and so won't move spontaneously. That is exactly as safe as
walking from the parent.

So I think d_walk() might need some help to avoid getting lost. It could
probably simply check if its cursor changed ->d_inode between dropping
->d_lock and retaking it. If it did, then restart.

It isn't clear to me that we can track a "end of positive" location. I
think we would need to move negatives to the end and positives to the
start. Can you see a down-side with doing that?

Thanks,
NeilBrown

>
> At least for fsnotify, positive->negative transition is not a problem
> w.r.t skipping entry and observing entry twice during positive iteration.
>
> If negative->positive transitions inserts at END_OF_POSITIVE
> location, then should be fine as well?
>
> Iterators that need to iterate all children can do this under lock.
>
> Does that make sense?
>
> Thanks,
> Amir.
>