Re: [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery

From: Bowman, Terry
Date: Wed Jul 02 2025 - 17:07:11 EST




On 6/27/2025 6:05 AM, Jonathan Cameron wrote:
> On Thu, 26 Jun 2025 17:42:42 -0500
> Terry Bowman <terry.bowman@xxxxxxx> wrote:
>
>> Create cxl_do_recovery() to provide uncorrectable protocol error (UCE)
>> handling. Follow similar design as found in PCIe error driver,
>> pcie_do_recovery(). One difference is cxl_do_recovery() will treat all UCEs
>> as fatal with a kernel panic. This is to prevent corruption on CXL memory.
>>
>> Export the PCI error driver's merge_result() to CXL namespace.
> I think this may be a confusion from earlier review. Anyhow, it should
> be namespaced in the sense of not exporting something the vague name of
> merge_result but it's PCI code, not CXL code and we don't have the dangerous
> interface argument to justify putting it in the CXL namespace so I think
> a namespaced EXPORT makes little sense for this one.
>
> Jonathan
>
>
>> Introduce
>> PCI_ERS_RESULT_PANIC and add support in merge_result() routine. This will
>> be used by CXL to panic the system in the case of uncorrectable protocol
>> errors. PCI error handling is not currently expected to use the
>> PCI_ERS_RESULT_PANIC.
>>
>> Copy pci_walk_bridge() to cxl_walk_bridge(). Make a change to walk the
>> first device in all cases.
>>
>> Copy the PCI error driver's report_error_detected() to cxl_report_error_detected().
>> Note, only CXL Endpoints and RCH Downstream Ports(RCH DSP) are currently
>> supported. Add locking for PCI device as done in PCI's report_error_detected().
>> This is necessary to prevent the RAS registers from disappearing before
>> logging is completed.
>>
>> Call panic() to halt the system in the case of uncorrectable errors (UCE)
>> in cxl_do_recovery(). Export pci_aer_clear_fatal_status() for CXL to use
>> if a UCE is not found. In this case the AER status must be cleared and
>> uses pci_aer_clear_fatal_status().
>>
>> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index de6381c690f5..63fceb3e8613 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -21,9 +21,12 @@
>> #include "portdrv.h"
>> #include "../pci.h"
>>
>> -static pci_ers_result_t merge_result(enum pci_ers_result orig,
>> - enum pci_ers_result new)
>> +pci_ers_result_t merge_result(enum pci_ers_result orig,
>> + enum pci_ers_result new)
>> {
>> + if (new == PCI_ERS_RESULT_PANIC)
>> + return PCI_ERS_RESULT_PANIC;
>> +
>> if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
>> return PCI_ERS_RESULT_NO_AER_DRIVER;
>>
>> @@ -45,6 +48,7 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig,
>>
>> return orig;
>> }
>> +EXPORT_SYMBOL_NS_GPL(merge_result, "CXL");
> Do we care about namespacing this? I think not given it is PCIe code
> and hardly destructive for other drivers to mess with it if they like.
>
> I would namespace it in the sense of renaming it to make it clear
> it's about pci errors though.
>
> pci_ers_merge_result() perhaps?
>
> Do that as a percursor patch.
>

Good idea. There is a lot of changes related to just exporting this and changing
the name. I've changed the namespace export to be:

EXPORT_SYMBOL(pci_ers_merge_result);

I moved this and its related required changes into an earlier patch.

-Terry

>>
>> static int report_error_detected(struct pci_dev *dev,
>> pci_channel_state_t state,