Re: [PATCH v1] PCI/DPC: Fix AER error logging for DPC/EDR triggered events
From: Bjorn Helgaas
Date: Mon Mar 30 2026 - 18:31:12 EST
On Wed, Mar 18, 2026 at 10:48:07AM -0700, Kuppuswamy Sathyanarayanan wrote:
> On 3/18/2026 10:22 AM, Bjorn Helgaas wrote:
> > On Wed, Mar 18, 2026 at 10:04:49AM -0700, Kuppuswamy Sathyanarayanan wrote:
> >> aer_print_error() skips printing if ratelimit_print[i] is not set.
> >> In the native AER path, ratelimit_print is initialized by
> >> add_error_device() during source device discovery, and is set to 1
> >> for fatal errors to bypass rate limiting since fatal errors should
> >> always be logged.
> >>
> >> The DPC/EDR path uses the DPC-capable port as the error source and
> >> reads its AER uncorrectable error status registers directly in
> >> dpc_get_aer_uncorrect_severity(). Since it does not go through
> >> add_error_device(), ratelimit_print[0] is left uninitialized and zero.
> >> As a result, aer_print_error() silently drops all AER error messages
> >> for DPC/EDR triggered events.
> >>
> >> Set ratelimit_print[0] to 1 to bypass rate limiting and always print
> >> AER logs for fatal errors.
To be precise, I think this bypasses rate limiting for all
uncorrectable errors (both fatal and non-fatal) that cause DPC to be
triggered, i.e., uncorrectable errors detected directly by the DPC
port, right?
Uncorrectable errors detected downstream of the DPC port would
generate ERR_NONFATAL or ERR_FATAL messages. When the DPC port
receives those, it triggers DPC but logs only the "containment event
... received from" message. That message isn't ratelimited, and this
patch doesn't change that. I guess there aren't any AER log details
to log in this case because they're in downstream devices that we
can't read while DPC is triggered.
> >> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> >> Co-developed-by: Goudar Manjunath Ramanagouda <manjunath.ramanagouda.goudar@xxxxxxxxx>
> >> Signed-off-by: Goudar Manjunath Ramanagouda <manjunath.ramanagouda.goudar@xxxxxxxxx>
> >> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
> >
> > I think this does the same as
> > https://git.kernel.org/cgit/linux/kernel/git/pci/pci.git/commit/?id=d4d1ecff2c2d
> > which is already queued for v7.1.
>
> Thanks for the reference.
>
> Since errors in the DPC path leads to port containment, I think it
> is best to always log them for reference and debug purposes. So I
> think we don't need to export aer_print_init() from the AER driver
> (which can ratelimit non-fatal DPC error). Instead we can by default
> skip ratelimit for DPC errors by initializing ratelimit_print[0] =
> 1.
I think that makes sense. With Sizhe's patch, the pci_warn() in
dpc_process_error() is not ratelimited but the aer_print_error() part
is, so we always see the "containment event" warning but may not see
the rest.
I guess we only call dpc_get_aer_uncorrect_severity() for the
PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR case; the NFE, FE, and IN_EXT
cases aren't affected by this patch.
I replaced Sizhe's ratelimit patch on pci/dpc with this one, keeping
the patch that holds a reference while calling dpc_process_error().
> >> ---
> >> drivers/pci/pcie/dpc.c | 1 +
> >> 1 file changed, 1 insertion(+)
> >>
> >> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> >> index fc18349614d7..7605ddd9f0ba 100644
> >> --- a/drivers/pci/pcie/dpc.c
> >> +++ b/drivers/pci/pcie/dpc.c
> >> @@ -256,6 +256,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
> >>
> >> info->dev[0] = dev;
> >> info->error_dev_num = 1;
> >> + info->ratelimit_print[0] = 1;
> >>
> >> return 1;
> >> }
> >> --
> >> 2.43.0
> >>
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer
>