Re: [PATCH net V4 3/4] net/mlx5e: SD, Fix missing cleanup on probe error
From: Jakub Kicinski
Date: Wed Apr 29 2026 - 21:43:11 EST
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5e: SD, Fix missing cleanup on probe error
This commit adds a missing mlx5_sd_cleanup() call to the probe error path.
The commit message notes a similar gap exists on the resume path leaving
sd == NULL on failure, but defers the fix to a follow-up series.
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 5a46870c4b74..e21affd0ffc4 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
[ ... ]
> @@ -6912,9 +6912,16 @@ static int mlx5e_probe(struct auxiliary_device *adev,
> return err;
>
> actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
> - if (actual_adev)
> - return _mlx5e_probe(actual_adev);
> + if (actual_adev) {
> + err = _mlx5e_probe(actual_adev);
> + if (err)
> + goto sd_cleanup;
> + }
> return 0;
> +
> +sd_cleanup:
> + mlx5_sd_cleanup(mdev);
> + return err;
> }
>
> static void _mlx5e_remove(struct auxiliary_device *adev)
Will leaving the resume gap unfixed as mentioned in the commit message cause
a NULL pointer dereference when a secondary device is later removed?
If a secondary device fails resume, sd becomes NULL. Later during removal,
mlx5_sd_get_adev() will check !sd and return the secondary adev.
This causes _mlx5e_remove() to be executed on the secondary adev:
_mlx5e_remove()
{
struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev);
struct net_device *netdev = mlx5e_dev->netdev;
...
}
Since _mlx5e_probe() is never called on secondary devices,
auxiliary_get_drvdata(adev) returns NULL. Does this mean accessing
mlx5e_dev->netdev will crash?
Would it be safer to add a NULL check for mlx5e_dev in _mlx5e_remove()
until the proper PM lifecycle fixes are implemented?