Re: [syzbot] [mm?] kernel BUG in collapse_scan_file

From: Lance Yang

Date: Thu Mar 19 2026 - 05:16:32 EST




On 2026/3/19 17:00, David Hildenbrand (Arm) wrote:
On 3/19/26 09:53, Lorenzo Stoakes (Oracle) wrote:
On Thu, Mar 19, 2026 at 04:05:38PM +0800, Lance Yang wrote:
Ccing Willy

IIUC, this is a dup of the earlier report[1], which I looked into back
in January. The root cause is the same: collapse_file() calls
xas_lock_irq() without resetting the xas state first, tripping the
XAS_INVALID() assertion:

#define xas_lock_irq(xas) xa_lock_irq(XAS_INVALID(xas)->xa)

static inline struct xa_state *XAS_INVALID(struct xa_state *xas)
{
XA_NODE_BUG_ON(xas->xa_node, xas_valid(xas));
return xas;
}

Added by commit:

commit 43b00759f21b10142094d1ae5ff65cbb368953a3
Author: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
Date: Sun Dec 14 10:53:31 2025 -0500

XArray: Add extra debugging check to xas_lock and friends

While tracking down a recent bug, we discovered somewhere that had
forgotten to call xas_reset() before calling xas_lock(). Add a debug
check to be sure that doesn't happen in future and fix all the places in
the test suite which were carelessly doing just this.

Suggested-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>

I posted a HACK fix at the time[2], but David pointed out that Willy
had mentioned it likely needs more thought[3].

Hmm we shouldn't leave this bug in place while working for a fancier fix??

Can we get _something_ going as an upstream fix? We can improve whatever we do
later right?

David, thoughts?

I recall Willy mentioning that the issue is likely a false positive.

IIUC, that commit is not upstream? So it only triggers in linux-next.

Right. That does not appear to be in upstream, I only see it in linux-next :)

Which means:

1) If it's a false positive, upstream is not effected (no XA_NODE_BUG_ON)

2) If it's not a false positive, upstream is effected but does not
trigger the XA_NODE_BUG_ON

Yep. So this particular BUG_ON is not affecting upstream directly.

That said, syzbot will likely keep hitting it in linux-next and
generating noise for us until it is addressed there ...