Re: [RFC PATCH] driver core: Don't link the device to the bus until we're ready to probe
From: Alan Stern
Date: Sat Mar 21 2026 - 11:55:32 EST
On Sat, Mar 21, 2026 at 01:05:48AM -0700, Doug Anderson wrote:
> Hi,
>
> On Sat, Mar 21, 2026 at 12:42 AM Greg Kroah-Hartman
> <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Sat, Mar 21, 2026 at 12:35:32AM -0700, Doug Anderson wrote:
> > > Hi,
> > >
> > > On Fri, Mar 20, 2026 at 10:41 PM Greg Kroah-Hartman
> > > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Mar 20, 2026 at 08:06:58PM -0700, Douglas Anderson wrote:
> > > > > The moment we link a "struct device" into the list of devices for the
> > > > > bus, it's possible probe can happen. This is because another thread
> > > > > can load the driver at any time and that can cause the device to
> > > > > probe. This has been seen in practice with a stack crawl that looks
> > > > > like this [1]:
> > > > >
> > > > > really_probe()
> > > > > __driver_probe_device()
> > > > > driver_probe_device()
> > > > > __driver_attach()
> > > > > bus_for_each_dev()
> > > > > driver_attach()
> > > > > bus_add_driver()
> > > > > driver_register()
> > > > > __platform_driver_register()
> > > > > init_module() [some module]
> > > > > do_one_initcall()
> > > > > do_init_module()
> > > > > load_module()
> > > > > __arm64_sys_finit_module()
> > > > > invoke_syscall()
> > > >
> > > > Are you sure this isn't just a platform bus issue? A bus should NOT be
> > > > allowing a driver to be added at the same time a device is being added
> > > > for that bus, ideally there should be a bus-specific lock somewhere for
> > > > this.
> > >
> > > Sure, if the right fix for this is somewhere in the platform bus code
> > > then I'd be happy with a patch there to fix it. ...but from my quick
> > > glance (admittedly, it's Friday night and I'm tired), it seems like
> > > the problem is just with driver_register() being called at the same
> > > time as device_add().
> > >
> > > Certainly adding some sort of locking could be a solution (happy for
> > > someone to tell me where to place them), but we'd have to make sure we
> > > aren't regressing performance for the normal case...
> > >
> > >
> > > > When a device is added to the bus, yes, a probe can happen, and is
> > > > expected to happen, for that device, so this feels odd.
> > > >
> > > > that being said, your patch does seem sane, and I don't see anything
> > > > obviously wrong with it. But it feels odd that this is just now showing
> > > > up for something that has been this way for a few decades...
> > >
> > > I suspect it's a latent bug that was triggered by a new Android
> > > feature. It's showing up on phones that have
> > > "ro.boot.load_modules_parallel" set. I think you can get to the
> > > relevant source code at:
> > >
> > > https://cs.android.com/android/platform/superproject/main/+/main:system/core/libmodprobe/libmodprobe.cpp?q=LoadModulesParallel
> > >
> > > I suspect the bug is never triggered with more normal module loading
> > > schemes. Indeed, one phone that has nearly the same set of drivers but
> > > has parallel module loading turned off has no reports of this
> > > problem...
> >
> > Ah, I think we always assumed that modules can NOT be loaded in
> > parallel, isn't there an internal module lock that prevents this from
> > happening?
> >
> > So yes, that might be the root problem here.
>
> It's late Friday night for me (technically Saturday morning), so I'm
> not going to dig now. ...but I'm fairly certain that Android isn't
> using any downstream kernel patches to accomplish its "parallel module
> loading". It's just userspace jamming modules in as fast as it can.
> Userspace loading modules quickly shouldn't cause the kernel to behave
> badly.
>
> If the right solution is to add more locking to the kernel to slow
> userspace down, that is also something I could try. It will likely end
> up impacting boot speed, but of course correctness comes first. Let me
> know if this is a direction I should dig (or someone is free to post a
> patch and I can test it).
As far as I know, there's no particular reason why modules shouldn't be
loaded in parallel, or at least, in very quick succession. Locking
shouldn't matter either -- that is, the existing locks ought to be
adequate.
> > > I'd also note that the only actual symptom we're seeing is with
> > > fw_devlink misbehaving (because dev->fwnode->dev wasn't set early
> > > enough). fw_devlink is a "new" (ish) feature, is officially optional,
> > > and isn't used on all hardware.
> >
> > That's true too, can we set that earlier?
>
> Yes, I can post a patch that _just_ moves the set of dev->fwnode->dev
> earlier, and that will probably fix my symptoms (I'll need to test).
> This patch already moves it a bit earlier, but if we don't break the
> linking out as a separate step it would need to move even higher up in
> the function.
>
> Originally, I was going to just propose that, but then I realized that
> some of the other code in device_add() probably also ought to run
> before we let the driver probe, and hence I ended up with this patch.
This sounds like a more generic problem. A bunch of things happen after
bus_add_device() that should be completed before probing can start; the
firmware node stuff is just one of them.
Splitting bus_add_device() in two sounds reasonable, although I would
rename the old routine to bus_link_device, since all it does it add some
groups and symlinks. The new routine can be called bus_add_device().
The real question is whether any of the other stuff that happens before
bus_probe_device() needs to come after the device is added to the bus's
list. The bus_notify() and kobject_uevent() calls are good examples; I
don't know what their requirements are. Should they be moved down,
between the new bus_add_device() and bus_probe_device()?
Alan Stern