Re: [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling

From: Umang Chheda

Date: Tue Mar 17 2026 - 03:22:04 EST



On 3/11/2026 8:55 AM, Ruidong Tian wrote:
>
>
> 在 2026/3/9 21:28, Umang Chheda 写道:
>> Hello Ruidong Tain,
>>
>> On 1/22/2026 3:16 PM, Ruidong Tian wrote:
>>> Motivation: Reliability in Modern Data Centers
>>> =================================================
>>> In modern data centers, proactive maintenance is essential for achieving high
>>> service availability. The practice of using Corrected Errors (CE) to predict
>>> impending Uncorrected Errors (UE) is already widely deployed at scale across
>>> the industry, like Alibaba[2], Tencent[4], Intel[1], AMD[2]. By analyzing CE
>>> telemetry, operators can identify failing hardware and perform migrations
>>> before catastrophic failures occur.
>>>
>>> Problem: Inefficient CE Collection on ARM
>>> ==========================================
>>> Currently, ARM-based systems primarily rely on "Firmware-First" error
>>> handling (e.g., via GHES). This path is inherently heavy-weight. To avoid
>>> significant performance overhead, firmware is often configured with high
>>> thresholds—reporting to the OS only after thousands of CEs have occurred.
>>> If the threshold is set lower, the high frequency of errors leads to
>>> excessive and costly context switching between the OS and firmware.
>>> Consequently, ARM platforms currently lack an efficient mechanism to collect
>>> the granular CE data required for high-fidelity error prediction.
>>>
>>> Solution: Kernel-First Handling via AEST
>>> ===========================================
>>> Other architectures have long utilized "Kernel-First" approaches for
>>> efficient CE collection: Intel provides CMCI (Corrected Machine Check
>>> Interrupt), and AMD has recently introduced similar CE interrupt support[5].
>>>
>>> On the ARM architecture, hardware already provides the necessary RAS
>>> Extensions[6], and the ACPI AEST specification[0] defines a standardized way for
>>> the OS to discover these error source registers. This series implements
>>> AEST support, enabling the kernel to:
>>>
>>>   - Discover error sources directly via ACPI tables.
>>>   - Handle CE notifications via direct interrupts.
>>>   - Bypass firmware overhead to collect every CE or use low-latency thresholds.
>>>
>>> This implementation provides the missing link for efficient RAS telemetry
>>> on ARM, bringing it to parity with other enterprise architectures.
>>
>> Thanks for posting this series enabling kernel-first handling for the Armv8 RAS extensions.
>>
>> We noticed the current implementation targets ACPI-based server platforms. For embedded/SoC systems, Device Tree is often the primary firmware description.
>> Do you have any plans to add DT-based support for the same flow? If not, do you see any blockers to extending this series to support DT
>> (e.g., DT bindings + discovery/registration path analogous to the ACPI plumbing) ?
>> If DT support is in-scope, We would be happy to align on the expected approach and help with review/development/testing for DT-based platforms.
>
> Hi Umang,
>
> Thanks for the reply.
>
> Adding Device Tree support should be easy. We just need a patch similar to "ACPI/AEST: Parse the AEST table" to fill the DT table into struct acpi_aest_node (might need renaming) and struct aest_hnode. The driver part requires minimal changes. 
>
> However, I'm not very familiar with DT and lack DT engineering support, so I would need some guidance on these DT-related questions:
>
> - Is there a specification that outlines the reporting     requirements for RAS extension information that is similar to AEST? 

Currently there is no spec/bindings defined for DT based systems. Based on discussions on earlier posted patch [1]  - expectation from maintainers
was to have DT bindings aligned to and equivalent to the AEST spec. Based on this expectation - we plan to propose DT bindings - which are  aligned
to AEST spec.

[1] Re: [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC - James Morse <https://lore.kernel.org/linux-edac/312fc8b8-7019-0c74-6a92-c6740cab5dad@xxxxxxx/>

> - How should the DT be designed?
>
Same as above - plan to align DT bindings equivalent to AEST spec.

> - How can I develop QEMU and modify DT files for debugging, etc.? 

We can share the steps for this for DT based system in this thread.


>
> I would be happy to adjust the patchset to meet the needs of both parties if you are prepared to invest the necessary effort(DT-related). In reality, I believe that just a little modification is required. 

Thanks, Yes - we are open to contribute in extending this patch series to support DT based systems as well.

[...]


Thanks,
Umang

>
>>
>>>
>>
>>
>>
>