Re: [PATCH] tg3: guard napi_disable and pci_disable_device calls

From: Yury M.

Date: Mon May 18 2026 - 12:43:12 EST


It looks like your opinion is based on the fact that AER recovery failure is a critical issue after which the system should be considered unstable. But I disagree with that. AER error recovery failure can happen not only because of PCI_ERS_RESULT_NO_AER_DRIVER (arguably considered a buggy or unsupported PCIe bus configuration). There are other AER recovery failure reasons. And until the system is locked by tg3, it has a chance to be recovered, and tg3 should not be a blocker here. I would accept the case where, after AER recovery failure, the tg3 device stops working. We can consider it as an unsupported case where manual intervention is required to recover tg3. But in this case, it blocks rtnl_lock.

For me, any AER error recovery failure is a legitimate condition in Linux. Maybe we can even consider that the whole bus is in an unexpected/unstable state. But we still should give the software a chance to recover the system. Especially when we are talking about the support of hot-pluggable devices. Insertion or removal of a new PCIe device (without AER recovery support) onto the same bus as the tg3 device should not cause an rtnl_lock hang.

Sorry, maybe I'm wrong and AER recovery failure is really a critical and unexpected system error.

Theoretically, in this particular case, we can try to fix PCI_ERS_RESULT_NO_AER_DRIVER and avoid changes in tg3. But I'm sure that there are other cases where AER recovery can fail and there are many other hot-pluggable devices that can return PCI_ERS_RESULT_NO_AER_DRIVER or cause another type of AER recovery failure. As such, it looks like it is better to handle AER recovery failure in tg3.

Another long-term fix could be to add another AER error handler callback to signal drivers about AER recovery failure.

On 5/18/26 16:43, Pavan Chebbi wrote:
To do cleanup on device-stop or error handling is a common approach, and it is not a problem. The problem is that tg3 doesn't track the device status. If a cleanup was performed (napi_disable and pci_disable_device called), tg3 should know that the device is not in an initialized state. When we subsequently try to disable the device, tg3 should not try to cleanup again. That is the problem I'm trying to fix. Frankly speaking, I'm adding a flag which signals that device cleanup was completed during the handling of an AER error, so when tg3 stops/removes the device, it should not perform cleanup again.
I understand this. But my view is that you are trying to fix a
symptom. The real issue seems to be that you are having some devices
in your bus/subtree that don't support AER.
Looking at pcie_do_recovery(), the recovery aborts the moment
NO_AER_DRIVER is received by the port driver. I would like to think
that this is by design.

Maybe PCI_ERS_RESULT_NO_AER_DRIVER is a rare case in the PCIe world, but I think that in any case, the tg3 driver should correctly handle AER recovery failure. Recovery can fail not only because of the PCI_ERS_RESULT_NO_AER_DRIVER return code. The problem is that a double napi_disable call causes a soft lockup, and not just one driver/device stops functioning—the whole system is affected.

Like I said, the tg3 driver does handle the AER sequence correctly.
The problem really is the tg3 remove() is facing issues because the
PCIe devices' topology is not suitable for AER recovery.
The issue a consequence of tg3 suffering a stall during AER because it
never recvd slot_reset() and resume().

I am not sure if we should allow this fix for what I think is a
situation arising out of an unsupported topology..
I will defer this to Michael or netdev maintainers for their feedback.