Re: [PATCH net] net/mlx5e: Fix use-after-free in mlx5e_tx_reporter_timeout_recover

From: Cosmin Ratiu

Date: Tue May 12 2026 - 07:09:52 EST


On Wed, 2026-04-08 at 19:44 +0100, Matt Fleming wrote:
> From: Matt Fleming <mfleming@xxxxxxxxxxxxxx>

First of all, apologies for the delay, I missed this and it seems
nobody else reacted for more than a month.

Next time, you will probably get more immediate reactions if you
directly CC the people involved in the patch which introduced the bug.
This will also make the patchwork checkers happier.

>
> mlx5e_tx_reporter_timeout_recover() accesses sq->netdev after
> mlx5e_safe_reopen_channels() has torn down and freed the channel (and
> its embedded SQs). Replace the three sq->netdev references with
> priv->netdev which is safe because priv outlives channel teardown.
>
> The netdev_err() call already used priv->netdev for this reason; make
> the trylock/unlock and health_channel_eq_recover calls consistent.
>
> This fixes the following KASAN splat:
>
>   BUG: KASAN: use-after-free in
> mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core]
>   Read of size 8 at addr ffff889860ed0b28 by task kworker/u113:2/5277
>
>   Call Trace:
>    mlx5e_tx_reporter_timeout_recover+0x1dd/0x360 [mlx5_core]
>    devlink_health_reporter_recover+0xa2/0x150
>    devlink_health_report+0x254/0x7c0
>    mlx5e_reporter_tx_timeout+0x297/0x380 [mlx5_core]
>    mlx5e_tx_timeout_work+0x109/0x170 [mlx5_core]
>    process_one_work+0x677/0xf20
>    worker_thread+0x51f/0xd90
>    kthread+0x3a5/0x810
>    ret_from_fork+0x208/0x400
>    ret_from_fork_asm+0x1a/0x30
>
> Fixes: 83ac0304a2d7 ("net/mlx5e: Fix deadlocks between devlink and
> netdev instance locks")
> Signed-off-by: Matt Fleming <mfleming@xxxxxxxxxxxxxx>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> index afdeb1b3d425..8409ae73768f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
> @@ -160,13 +160,13 @@ static int
> mlx5e_tx_reporter_timeout_recover(void *ctx)
>   * channels are being closed for other reason and this work
> is not
>   * relevant anymore.
>   */
> - while (!netdev_trylock(sq->netdev)) {
> + while (!netdev_trylock(priv->netdev)) {
>   if (!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv-
> >state))
>   return 0;
>   msleep(20);
>   }
>  
> - err = mlx5e_health_channel_eq_recover(sq->netdev, eq, sq-
> >cq.ch_stats);
> + err = mlx5e_health_channel_eq_recover(priv->netdev, eq, sq-
> >cq.ch_stats);
>   if (!err) {
>   to_ctx->status = 0; /* this sq recovered */
>   goto out;
> @@ -186,7 +186,7 @@ static int mlx5e_tx_reporter_timeout_recover(void
> *ctx)
>      "mlx5e_safe_reopen_channels failed recovering
> from a tx_timeout, err(%d).\n",
>      err);
>  out:
> - netdev_unlock(sq->netdev);
> + netdev_unlock(priv->netdev);
>   return err;
>  }
>  

Thank you for the fix, it is a real problem which can happen if direct
SQ recovery fails and all channels need to be reopened, which is
apparently what happened in your KASAN report.

Reviewed-by: Cosmin Ratiu <cratiu@xxxxxxxxxx>