Re: [PATCH net V4 4/4] net/mlx5e: SD, Fix race condition in secondary device probe/remove

From: Jakub Kicinski

Date: Wed Apr 29 2026 - 21:43:52 EST


This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5e: SD, Fix race condition in secondary device probe/remove

This commit updates secondary devices to acquire the device lock of the
primary auxiliary device during probe, suspend, resume, and remove
operations. It introduces mlx5_sd_get_adev() and mlx5_sd_put_adev() to
manage this lock ordering and add a second devcom readiness check.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> index 7a1787f15320..a43ae482a679 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c

[ ... ]

> @@ -563,5 +567,18 @@ struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
> if (dev == primary)
> return adev;
>
> + device_lock(&primary->priv.adev[idx]->adev.dev);

If mlx5_devcom_comp_is_ready() returned true just before this, but the
secondary thread is preempted before reaching device_lock(), could a
concurrent removal of the primary device cause primary->priv.adev[idx]
to become NULL or freed?

It looks like the removal path (such as mlx5_unregister_device) frees the
mlx5_adev memory and explicitly sets primary->priv.adev[idx] to NULL. If
the secondary thread then resumes, it might dereference a NULL or freed
pointer when attempting to acquire the lock.

Since the second readiness check happens after the device_lock() call, does
this leave a window where memory corruption or a panic can still occur?

Additionally, since the driver core already holds device_lock(&adev->dev)
for the secondary device during callbacks like probe() and remove(), will
acquiring the primary device's lock here trigger a lockdep warning for
possible recursive locking?

Both the primary and secondary auxiliary devices use the same generic lock
class assigned in device_initialize(). While the fixed secondary to primary
ordering avoids an actual deadlock, lockdep might not distinguish between
the two mutexes without explicit lockdep subclassing.

> + /* In case primary finish removing its adev */
> + if (!mlx5_devcom_comp_is_ready(sd->devcom)) {
> + device_unlock(&primary->priv.adev[idx]->adev.dev);
> + return NULL;
> + }
> return &primary->priv.adev[idx]->adev;
> }