Re: [PATCH] tg3: guard napi_disable and pci_disable_device calls

From: Pavan Chebbi

Date: Mon May 18 2026 - 11:44:40 EST


>
> To do cleanup on device-stop or error handling is a common approach, and it is not a problem. The problem is that tg3 doesn't track the device status. If a cleanup was performed (napi_disable and pci_disable_device called), tg3 should know that the device is not in an initialized state. When we subsequently try to disable the device, tg3 should not try to cleanup again. That is the problem I'm trying to fix. Frankly speaking, I'm adding a flag which signals that device cleanup was completed during the handling of an AER error, so when tg3 stops/removes the device, it should not perform cleanup again.

I understand this. But my view is that you are trying to fix a
symptom. The real issue seems to be that you are having some devices
in your bus/subtree that don't support AER.
Looking at pcie_do_recovery(), the recovery aborts the moment
NO_AER_DRIVER is received by the port driver. I would like to think
that this is by design.

>
> Maybe PCI_ERS_RESULT_NO_AER_DRIVER is a rare case in the PCIe world, but I think that in any case, the tg3 driver should correctly handle AER recovery failure. Recovery can fail not only because of the PCI_ERS_RESULT_NO_AER_DRIVER return code. The problem is that a double napi_disable call causes a soft lockup, and not just one driver/device stops functioning—the whole system is affected.
>

Like I said, the tg3 driver does handle the AER sequence correctly.
The problem really is the tg3 remove() is facing issues because the
PCIe devices' topology is not suitable for AER recovery.
The issue a consequence of tg3 suffering a stall during AER because it
never recvd slot_reset() and resume().

I am not sure if we should allow this fix for what I think is a
situation arising out of an unsupported topology..
I will defer this to Michael or netdev maintainers for their feedback.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature