Re: [PATCH 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count

From: Andrew Morton
Date: Sun May 04 2025 - 02:55:03 EST


On Sun, 4 May 2025 08:36:23 +0200 Max Kellermann <max.kellermann@xxxxxxxxx> wrote:

> On Sun, May 4, 2025 at 4:47 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> > Documenation/, please?
>
> Do you mean Documentation/ABI/testing/ ? (like
> Documentation/ABI/testing/sysfs-kernel-oops_count)
> I'll add that; I was confused by the directory name "testing" and
> didn't expect to find actual documentation there.

I find it helpful to grep around for similar things:

hp2:/usr/src/25> egrep -rl "hung_task_detect_count|warn_count|oops_count" Documentation
Documentation/ABI/testing/sysfs-kernel-warn_count
Documentation/ABI/testing/sysfs-kernel-oops_count
Documentation/admin-guide/sysctl/kernel.rst

I'm not sure that we've been very complete/consistent in these things.
If you have time, please check that we've covered these things
appropriately.


> > > Having this is useful for monitoring tools.
> >
> > Useful how? Use cases? Examples?
>
> To detect whether the machine is healthy. If the kernel has
> experienced a soft lockup, it's probably due to a kernel bug, and I'd
> like to detect that quickly and easily. There is currently no way to
> detect that, other than parsing dmesg. Or observing indirect effects:
> such as certain tasks not responding, but then I need to observe all
> tasks. I'd rather be able to detect the primary cause easily - just
> like some people decided that they want to observe an oops and a
> warning counter.
>
> We always run the latest stable kernel on our production servers, and
> this has brought great sorrow for the last year (I think the big netfs
> drama began in 6.9 or so when the pgpriv2 refactoring began). There
> have been numerous netfs/NFS/Ceph regressions, we had just as many
> production outages, and the maintainers wouldn't respond to my bug
> reports, so I had to figure it all out myself.
> The latest regression that quickly took down our servers was a
> "stable" backport of a performance optimization for epoll in 6.14.4,
> leading to soft lockups in ep_poll(), see
> https://lore.kernel.org/lkml/20250429185827.3564438-1-max.kellermann@xxxxxxxxx/
> - but we observed it only after everything had already fallen apart.
> Since our main process has switched from epoll to io_uring, only
> second-order processes were falling apart. Had we had a soft lockup
> counter, we could have noticed it earlier.

That's all great stuff, thanks. Please include it in the [0/N]?

> > A proposal to permanently extend Linux's userspace API requires better
> > justification than an unsubstantiated assertion, surely?
>
> The commits that added warn_count/oops_count literally only said "is a
> fairly interesting signal". See commits 9db89b411170 ("exit: Expose
> "oops_count" to sysfs") and 8b05aa263361 ("panic: Expose "warn_count"
> to sysfs"). That's quite an unsubstantiated assertion, too, isn't it?
>
> I agree with you, but I thought the point for a soft lockup counter
> was trivial enough to see, and I didn't think you needed more
> justification than the other counters.

um, well, Kees, sorry, that wasn't a world class effort.