Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure

From: Lukas Wunner

Date: Wed May 20 2026 - 05:01:01 EST


On Tue, May 19, 2026 at 05:05:20PM +0100, Yury M. wrote:
> Root port can detect AER error with source 0000:00:00.0.
>
> In this case, we call find_source_device -> find_device_iter. The
> 'multi-error' flag is not set, and we are looking for the first error (not
> all). This means that for any error with the 0000:00:00.0 source on the root
> port, we will report the error for the first device on the bus.

No, is_error_source() considers bus number 0 as a bogus number
and will iterate over all devices on the bus.

> In my case, an AER error reported by 0000:06:08.0 will be logged as an error
> reported by 0000:06:07.0 if AER recovery constantly fails.

The problem is that 0000:06:08.0 reports an Advisory Non-Fatal Error,
i.e. it sets the ANFE bit in the Correctable Error Status Register
and signals (only) a Correctable Error, even though it also sets bits
in the Uncorrectable Error Status Register.

The kernel lacks support for ANFE handling and will only clear the bits
in the Correctable Error Status Register. It neglects to also clear
(and report) the bits in the Uncorrectable Error Status Register.

There was an effort two years back to bring up ANFE support but it
fizzled out. I talked to the submitter and he's now busy with other
things:

https://lore.kernel.org/r/20240620025857.206647-1-zhenzhong.duan@xxxxxxxxx/

It's on my todo list to respin his series but I can't promise when
I'll get to it.

Thanks,

Lukas