Re: [PATCH net v3] net: tg3: guard napi_disable and pci_disable_device calls

From: Pavan Chebbi

Date: Tue May 26 2026 - 05:13:05 EST


On Tue, May 26, 2026 at 12:39 PM ALOK TIWARI <alok.a.tiwari@xxxxxxxxxx> wrote:
>
>
>
> On 5/26/2026 9:07 AM, Pavan Chebbi wrote:
> > On Mon, May 25, 2026 at 3:19 PM Yury Murashka <yurypm@xxxxxxxxxx> wrote:
> >>
> >> During PCIe hot-plug events, uncorrectable errors can be reported and
> >> AER recovery for the tg3 device is initiated by the AER kernel driver.
> >> The tg3_io_error_detected function is the AER error recovery handler.
> >>
> >> From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
> >> napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
> >> We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume
> >> will be called. But AER error recovery can fail. For example, when one
> >> of PCIe devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER.
> >> As a result, tg3_io_slot_reset and tg3_io_resume are not called, PCIe
> >> device is disabled and NAPI is disabled (pci_disable_device and
> >> napi_disable are called from tg3_io_error_detected). Then we can try to
> >> disable PCIe link and napi_disable will be called again:
> >>
> >> napi_disable+0x1b/0x1b0
> >> tg3_napi_disable+0x89/0xa0 [tg3]
> >> tg3_netif_stop+0x37/0xe3 [tg3]
> >> tg3_stop+0x30/0x160 [tg3]
> >> tg3_close+0x2a/0x60 [tg3]
> >> __dev_close_many+0xad/0x130
> >> dev_close_many+0xb2/0x190
> >> unregister_netdevice_many_notify+0x19d/0xa00
> >> unregister_netdevice_queue+0xf8/0x140
> >> unregister_netdev+0x1c/0x30
> >> tg3_remove_one+0xaa/0x150 [tg3]
> >> pci_device_remove+0x42/0xb0
> >> device_release_driver_internal+0x19c/0x200
> >> pci_stop_bus_device+0x85/0xb0
> >> pci_stop_bus_device+0x2c/0xb0
> >> pci_stop_bus_device+0x2c/0xb0
> >> pci_stop_and_remove_bus_device+0x12/0x20
> >> pciehp_unconfigure_device+0x9f/0x160
> >> pciehp_disable_slot+0x67/0x100
> >> pciehp_handle_presence_or_link_change+0x77/0x350
> >>
> >> This is not expected by napi_disable and a thread can be locked in
> >> napi_disable forever. We have pcierr_recovery to cover a similar issue,
> >> but for fatal errors. We cannot reuse this flag because it is reset in
> >> tg3_io_resume, but it is not called when AER recovery fails.
> >>
> >> Similarly, if an AER error is reported and tg3_io_error_detected calls
> >> pci_disable_device, a subsequent device removal via tg3_remove_one or
> >> tg3_shutdown will call pci_disable_device again for the already-disabled
> >> device.
> >>
> >> Add a napi_enabled flag to struct tg3 to track whether napi_enable has
> >> been called. Guard tg3_napi_disable() so it returns early if NAPI was
> >> not previously enabled. Also guard pci_disable_device() calls in
> >> tg3_remove_one() and tg3_shutdown() with pci_is_enabled() to avoid
> >> disabling an already-disabled device.
> >>
> >> Fixes: b45aa2f6192e ("tg3: Add EEH support")
> >> Signed-off-by: Yury Murashka <yurypm@xxxxxxxxxx>
> >> ---
> >> drivers/net/ethernet/broadcom/tg3.c | 14 ++++++++++++--
> >> drivers/net/ethernet/broadcom/tg3.h | 1 +
> >> 2 files changed, 13 insertions(+), 2 deletions(-)
> >>
> >
> > LGTM.
> > Reviewed-by: Pavan Chebbi <pavan.chebbi@xxxxxxxxxxxx>
> > nit: The revision-change log for the patch is missing..
>
> It seems this patch addresses two separate issues in the AER recovery flow:
>
> NAPI enable/disable state handling
> PCI device enabled/disabled state handling
>
> Would it make sense to split this into two patches?

I am fine with a single patch, both guards together make the solution
complete which has a single root case.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature