Re: [PATCH v2] device property: Make modifications of fwnode "flags" thread safe
From: Doug Anderson
Date: Fri Mar 20 2026 - 23:19:01 EST
Hi,
On Thu, Mar 19, 2026 at 10:52 AM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> Hi,
>
> On Thu, Mar 19, 2026 at 10:25 AM Saravana Kannan <saravanak@xxxxxxxxxx> wrote:
> >
> > On Tue, Mar 17, 2026 at 9:04 AM Douglas Anderson <dianders@xxxxxxxxxxxx> wrote:
> > >
> > > In various places in the kernel, we modify the fwnode "flags" member
> > > by doing either:
> > > fwnode->flags |= SOME_FLAG;
> > > fwnode->flags &= ~SOME_FLAG;
> > >
> > > This type of modification is not thread-safe. If two threads are both
> > > mucking with the flags at the same time then one can clobber the
> > > other.
> > >
> > > While flags are often modified while under the "fwnode_link_lock",
> > > this is not universally true.
> > >
> > > Create some accessor functions for setting, clearing, and testing the
> > > FWNODE flags and move all users to these accessor functions. New
> > > accessor functions use set_bit() and clear_bit(), which are
> > > thread-safe.
> > >
> > > Cc: stable@xxxxxxxxxxxxxxx
> > > Fixes: c2c724c868c4 ("driver core: Add fw_devlink_parse_fwtree()")
> > > Reviewed-by: Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx>
> > > Acked-by: Mark Brown <broonie@xxxxxxxxxx>
> > > Reviewed-by: Wolfram Sang <wsa+renesas@xxxxxxxxxxxxxxxxxxxx>
> > > Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
> > > ---
> > > While this patch is not known for sure to fix any specific issues, it
> > > seems possible that it could fix some rare problems. I'm currently
> > > trying to track down a hard-to-reproduce heisenbug and one (currently
> > > unproven) theory I had was that the fwnode flags could be getting
> > > messed up like this. Even if turns out not to fix my heisenbug,
> > > though, this seems like a worthwhile change to take.
> >
> > Reviewed-by: Saravana Kannan <saravanak@xxxxxxxxxx>
>
> Thanks for the review!
>
>
> > Thanks Doug. Hope this isn't the cause of the hisenbug. If you report
> > it here, I might be able to take a look at it too (no promises).
>
> I don't _think_ it fixes my bug, but I'm still not 100% sure because
> the bug can take a day or so to reproduce and it appears to only
> reproduce on official kernels built by the builder. :( This makes it
> hard to say anything for certain and also hard for me to inject extra
> debug logic.
Just in case anyone out there was wracking their brains based on my
description of the bug...
I've made progress in getting the issue to reproduce even with debug
information added. With that, I've found that
device_links_driver_bound() is getting called where `dev->fwnode->dev`
is NULL. That prevents it from running the ever-important
__fw_devlink_pickup_dangling_consumers().
I can see that device_add() has started, but it just hasn't made it to
the `dev->fwnode->dev = dev;` line yet. My printout next to that line
shows up _after_ my printout in device_links_driver_bound().
So obviously something can happen to cause the device to probe before
the call to bus_probe_device().
OK, I managed to get a stack crawl for when `dev->fwnode->dev ==
NULL`. It looks like this (FWIW, it's a 6.6 kernel but issue also
reproduces on our 6.12 kernel, and I see no reason it wouldn't
reproduce on mainline):
Call trace:
dump_backtrace+0xe8/0x108
show_stack+0x18/0x28
dump_stack_lvl+0x50/0x6c
dump_stack+0x18/0x24
device_links_driver_bound+0xa4/0x4b4
driver_bound+0x48/0x1c4
really_probe+0x244/0x374
__driver_probe_device+0xa0/0x12c
driver_probe_device+0x3c/0x218
__driver_attach+0x110/0x1ec
bus_for_each_dev+0x104/0x160
driver_attach+0x24/0x34
bus_add_driver+0x154/0x270
driver_register+0x68/0x104
__platform_driver_register+0x24/0x34
init_module+0x20/0xfe4 [max77779_pmic_pinctrl
e09198e651272bc5df70245355346d6eb1ba3a8f]
do_one_initcall+0xdc/0x360
do_init_module+0x58/0x23c
load_module+0xffc/0x1130
__arm64_sys_finit_module+0x260/0x300
invoke_syscall+0x58/0x114
It looks like what happens is that immediately after device_add()
calls bus_add_device() there's a possibility of another thread
inserting the module holding the device driver. That means that the
driver can start probing much earlier than we expect.
I've posted an RFC patch to fix this. If folks are interested, please review it:
https://lore.kernel.org/r/20260320200656.RFC.1.Id750b0fbcc94f23ed04b7aecabcead688d0d8c17@changeid
Given the stack crawl I got, I'm fairly certain that this will fix the
problem, but I'll also let reboot tests run over the weekend to
confirm.
-Doug