Re: [PATCH net-next 00/10] net: lan966x: add support for PCIe FDMA
From: Herve Codina
Date: Fri Mar 27 2026 - 06:36:19 EST
Hi Daniel,
On Thu, 26 Mar 2026 16:48:33 +0100
Daniel Machon <daniel.machon@xxxxxxxxxxxxx> wrote:
...
>
> As I remembered, doing rmmod on the lan966x_switch followed by modprobe
> lan966x_switch works fine. This is because neither the switch core, nor the FDMA
> engine is reset, so they remain in sync.
>
> When the lan966x_pci module is removed and reloaded (what you did), the DT
> overlay is re-applied, which causes the reset controller
> (reset-microchip-sparx5) to re-probe. During probe, it performs a GCB soft reset
> that resets the switch core, but protects the CPU domain from the reset. The
> FDMA engine is part of the CPU domain, so it is not reset.
>
> This leaves the switch core in a reset state while the FDMA
> retains state from the previous driver instance. When the switch driver
> subsequently probes and activates the FDMA channels, the two are out of
> sync, and the FDMA immediately reports extraction errors.
>
> Theres actually an FDMA register called NRESET that resets the FDMA controller
> state. Calling this in the FDMA init path causes traffic to work correctly on
> lan966x_pci reload, but it does not get rid of the FDMA splats you posted above.
> They get queued up between the switch core reset, in the reset controller, and
> the FDMA enabling. I tried different approaches to drain or flush queues, but
> they wont go away entirely.
>
> The only thing that seems to work consistently is to *not* do the soft reset in
> the reset controller for the PCI path. The soft reset is actually the problem:
> it only resets the switch core while protecting the CPU domain (including FDMA),
> causing a desync.
>
> A simple fix could be (in reset-microchip-sparx5.c):
>
> +static bool mchp_reset_is_pci(struct device *dev)
> +{
> + for (dev = dev->parent; dev; dev = dev->parent) {
> + if (dev_is_pci(dev))
> + return true;
> + }
> + return false;
> +}
>
> - /* Issue the reset very early, our actual reset callback is a noop. */
> - err = sparx5_switch_reset(ctx);
> - if (err)
> - return err;
> + /* Issue the reset very early, our actual reset callback is a noop.
> + *
> + * On the PCI path, skip the reset. The endpoint is already in
> + * power-on reset state on the first probe. On subsequent probes
> + * (after driver reload), resetting the switch core while the FDMA
> + * retains state (CPU domain is protected from the soft reset)
> + * causes the two to go out of sync, leading to FDMA extraction
> + * errors.
> + */
> + if (!mchp_reset_is_pci(&pdev->dev)) {
> + err = sparx5_switch_reset(ctx);
> + if (err)
> + return err;
> + }
>
> Could you test it and see if it helps the problem on your side.
>
I have tested it on my ARM and x86 system. It fixes the lan966x_pci module
unloading / reloading issue.
However an other regression is present. After a reboot, without power
off/on, the board is not working (tested on both my ARM and x86 systems).
According to your explanation, this makes sense.
IMHO, the problem is that we cannot make the assumption that "The endpoint
is already in power-on reset state on the first probe". That's not true
when you just call the reboot command.
Best regards,
Hervé