Re: [PATCH] ACPI: APEI: Handle repeated SEA error interrupts storm scenarios
From: hejunhao
Date: Tue Mar 24 2026 - 06:06:08 EST
Hi shuai xue,
On 2026/3/3 22:42, Shuai Xue wrote:
> Hi, junhao,
>
> On 2/27/26 8:12 PM, hejunhao wrote:
>>
>>
>> On 2025/11/4 9:32, Shuai Xue wrote:
>>>
>>>
>>> 在 2025/11/4 00:19, Rafael J. Wysocki 写道:
>>>> On Thu, Oct 30, 2025 at 8:13 AM Junhao He <hejunhao3@xxxxxxxxxxxxxx> wrote:
>>>>>
>>>>> The do_sea() function defaults to using firmware-first mode, if supported.
>>>>> It invoke acpi/apei/ghes ghes_notify_sea() to report and handling the SEA
>>>>> error, The GHES uses a buffer to cache the most recent 4 kinds of SEA
>>>>> errors. If the same kind SEA error continues to occur, GHES will skip to
>>>>> reporting this SEA error and will not add it to the "ghes_estatus_llist"
>>>>> list until the cache times out after 10 seconds, at which point the SEA
>>>>> error will be reprocessed.
>>>>>
>>>>> The GHES invoke ghes_proc_in_irq() to handle the SEA error, which
>>>>> ultimately executes memory_failure() to process the page with hardware
>>>>> memory corruption. If the same SEA error appears multiple times
>>>>> consecutively, it indicates that the previous handling was incomplete or
>>>>> unable to resolve the fault. In such cases, it is more appropriate to
>>>>> return a failure when encountering the same error again, and then proceed
>>>>> to arm64_do_kernel_sea for further processing.
>
> There is no such function in the arm64 tree. If apei_claim_sea() returns
Sorry for the mistake in the commit message. The function arm64_do_kernel_sea() should
be arm64_notify_die().
> an error, the actual fallback path in do_sea() is arm64_notify_die(),
> which sends SIGBUS?
>
If apei_claim_sea() returns an error, arm64_notify_die() will call arm64_force_sig_fault(inf->sig /* SIGBUS */, , , ),
followed by force_sig_fault(SIGBUS, , ) to force the process to receive the SIGBUS signal.
>>>>>
>>>>> When hardware memory corruption occurs, a memory error interrupt is
>>>>> triggered. If the kernel accesses this erroneous data, it will trigger
>>>>> the SEA error exception handler. All such handlers will call
>>>>> memory_failure() to handle the faulty page.
>>>>>
>>>>> If a memory error interrupt occurs first, followed by an SEA error
>>>>> interrupt, the faulty page is first marked as poisoned by the memory error
>>>>> interrupt process, and then the SEA error interrupt handling process will
>>>>> send a SIGBUS signal to the process accessing the poisoned page.
>>>>>
>>>>> However, if the SEA interrupt is reported first, the following exceptional
>>>>> scenario occurs:
>>>>>
>>>>> When a user process directly requests and accesses a page with hardware
>>>>> memory corruption via mmap (such as with devmem), the page containing this
>>>>> address may still be in a free buddy state in the kernel. At this point,
>>>>> the page is marked as "poisoned" during the SEA claim memory_failure().
>>>>> However, since the process does not request the page through the kernel's
>>>>> MMU, the kernel cannot send SIGBUS signal to the processes. And the memory
>>>>> error interrupt handling process not support send SIGBUS signal. As a
>>>>> result, these processes continues to access the faulty page, causing
>>>>> repeated entries into the SEA exception handler. At this time, it lead to
>>>>> an SEA error interrupt storm.
>
> In such case, the user process which accessing the poisoned page will be killed
> by memory_fauilre?
>
> // memory_failure():
>
> if (TestSetPageHWPoison(p)) {
> res = -EHWPOISON;
> if (flags & MF_ACTION_REQUIRED)
> res = kill_accessing_process(current, pfn, flags);
> if (flags & MF_COUNT_INCREASED)
> put_page(p);
> action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
> goto unlock_mutex;
> }
>
> I think this problem has already been fixed by commit 2e6053fea379 ("mm/memory-failure:
> fix infinite UCE for VM_PFNMAP pfn").
>
> The root cause is that walk_page_range() skips VM_PFNMAP vmas by default when
> no .test_walk callback is set, so kill_accessing_process() returns 0 for a
> devmem-style mapping (remap_pfn_range, VM_PFNMAP), making the caller believe
> the UCE was handled properly while the process was never actually killed.
>
> Did you try the lastest kernel version?
>
I retested this issue on the kernel v7.0.0-rc4 with the following debug patch and was still able to reproduce it.
@@ -1365,8 +1365,11 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
ghes_clear_estatus(ghes, &tmp_header, buf_paddr, fixmap_idx);
/* This error has been reported before, don't process it again. */
- if (ghes_estatus_cached(estatus))
+ if (ghes_estatus_cached(estatus)) {
+ pr_info("This error has been reported before, don't process it again.\n");
goto no_work;
+ }
the test log Only some debug logs are retained here.
[2026/3/24 14:51:58.199] [root@localhost ~]# taskset -c 40 busybox devmem 0x1351811824 32 0
[2026/3/24 14:51:58.369] [root@localhost ~]# taskset -c 40 busybox devmem 0x1351811824 32
[2026/3/24 14:51:58.458] [ 130.558038][ C40] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
[2026/3/24 14:51:58.459] [ 130.572517][ C40] {1}[Hardware Error]: event severity: recoverable
[2026/3/24 14:51:58.459] [ 130.578861][ C40] {1}[Hardware Error]: Error 0, type: recoverable
[2026/3/24 14:51:58.459] [ 130.585203][ C40] {1}[Hardware Error]: section_type: ARM processor error
[2026/3/24 14:51:58.459] [ 130.592238][ C40] {1}[Hardware Error]: MIDR: 0x0000000000000000
[2026/3/24 14:51:58.459] [ 130.598492][ C40] {1}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081010400
[2026/3/24 14:51:58.459] [ 130.607871][ C40] {1}[Hardware Error]: error affinity level: 0
[2026/3/24 14:51:58.459] [ 130.614038][ C40] {1}[Hardware Error]: running state: 0x1
[2026/3/24 14:51:58.459] [ 130.619770][ C40] {1}[Hardware Error]: Power State Coordination Interface state: 0
[2026/3/24 14:51:58.459] [ 130.627673][ C40] {1}[Hardware Error]: Error info structure 0:
[2026/3/24 14:51:58.459] [ 130.633839][ C40] {1}[Hardware Error]: num errors: 1
[2026/3/24 14:51:58.459] [ 130.639137][ C40] {1}[Hardware Error]: error_type: 0, cache error
[2026/3/24 14:51:58.459] [ 130.645652][ C40] {1}[Hardware Error]: error_info: 0x0000000020400014
[2026/3/24 14:51:58.459] [ 130.652514][ C40] {1}[Hardware Error]: cache level: 1
[2026/3/24 14:51:58.551] [ 130.658073][ C40] {1}[Hardware Error]: the error has not been corrected
[2026/3/24 14:51:58.551] [ 130.665194][ C40] {1}[Hardware Error]: physical fault address: 0x0000001351811800
[2026/3/24 14:51:58.551] [ 130.673097][ C40] {1}[Hardware Error]: Vendor specific error info has 48 bytes:
[2026/3/24 14:51:58.551] [ 130.680744][ C40] {1}[Hardware Error]: 00000000: 00000000 00000000 00000000 00000000 ................
[2026/3/24 14:51:58.551] [ 130.690471][ C40] {1}[Hardware Error]: 00000010: 00000000 00000000 00000000 00000000 ................
[2026/3/24 14:51:58.552] [ 130.700198][ C40] {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................
[2026/3/24 14:51:58.552] [ 130.710083][ T9767] Memory failure: 0x1351811: recovery action for free buddy page: Recovered
[2026/3/24 14:51:58.638] [ 130.790952][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:51:58.903] [ 131.046994][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:51:58.991] [ 131.132360][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:51:59.969] [ 132.071431][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:52:00.860] [ 133.010255][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:52:01.927] [ 134.034746][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:52:02.906] [ 135.058973][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:52:03.971] [ 136.083213][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:52:04.860] [ 137.021956][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:52:06.018] [ 138.131460][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:52:06.905] [ 139.070280][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:52:07.886] [ 140.009147][ C40] This error has been reported before, don't process it again.
[2026/3/24 14:52:08.596] [ 140.777368][ C40] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
[2026/3/24 14:52:08.683] [ 140.791921][ C40] {2}[Hardware Error]: event severity: recoverable
[2026/3/24 14:52:08.683] [ 140.798263][ C40] {2}[Hardware Error]: Error 0, type: recoverable
[2026/3/24 14:52:08.683] [ 140.804606][ C40] {2}[Hardware Error]: section_type: ARM processor error
[2026/3/24 14:52:08.683] [ 140.811641][ C40] {2}[Hardware Error]: MIDR: 0x0000000000000000
[2026/3/24 14:52:08.684] [ 140.817895][ C40] {2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081010400
[2026/3/24 14:52:08.684] [ 140.827274][ C40] {2}[Hardware Error]: error affinity level: 0
[2026/3/24 14:52:08.684] [ 140.833440][ C40] {2}[Hardware Error]: running state: 0x1
[2026/3/24 14:52:08.684] [ 140.839173][ C40] {2}[Hardware Error]: Power State Coordination Interface state: 0
[2026/3/24 14:52:08.684] [ 140.847076][ C40] {2}[Hardware Error]: Error info structure 0:
[2026/3/24 14:52:08.684] [ 140.853241][ C40] {2}[Hardware Error]: num errors: 1
[2026/3/24 14:52:08.684] [ 140.858540][ C40] {2}[Hardware Error]: error_type: 0, cache error
[2026/3/24 14:52:08.684] [ 140.865055][ C40] {2}[Hardware Error]: error_info: 0x0000000020400014
[2026/3/24 14:52:08.684] [ 140.871917][ C40] {2}[Hardware Error]: cache level: 1
[2026/3/24 14:52:08.684] [ 140.877475][ C40] {2}[Hardware Error]: the error has not been corrected
[2026/3/24 14:52:08.764] [ 140.884596][ C40] {2}[Hardware Error]: physical fault address: 0x0000001351811800
[2026/3/24 14:52:08.764] [ 140.892499][ C40] {2}[Hardware Error]: Vendor specific error info has 48 bytes:
[2026/3/24 14:52:08.766] [ 140.900145][ C40] {2}[Hardware Error]: 00000000: 00000000 00000000 00000000 00000000 ................
[2026/3/24 14:52:08.767] [ 140.909872][ C40] {2}[Hardware Error]: 00000010: 00000000 00000000 00000000 00000000 ................
[2026/3/24 14:52:08.767] [ 140.919598][ C40] {2}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................
[2026/3/24 14:52:08.768] [ 140.929346][ T9767] Memory failure: 0x1351811: already hardware poisoned
[2026/3/24 14:52:08.768] [ 140.936072][ T9767] Memory failure: 0x1351811: Sending SIGBUS to busybox:9767 due to hardware memory corruption
Apply the patch:
@@ -1365,8 +1365,11 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
ghes_clear_estatus(ghes, &tmp_header, buf_paddr, fixmap_idx);
/* This error has been reported before, don't process it again. */
- if (ghes_estatus_cached(estatus))
+ if (ghes_estatus_cached(estatus)) {
+ pr_info("This error has been reported before, don't process it again.\n");
+ rc = -ECANCELED;
goto no_work;
+ }
[2026/3/24 16:45:40.084] [root@localhost ~]# taskset -c 40 busybox devmem 0x1351811824 32 0
[2026/3/24 16:45:40.272] [root@localhost ~]# taskset -c 40 busybox devmem 0x1351811824 32
[2026/3/24 16:45:40.362] [ 112.279324][ C40] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
[2026/3/24 16:45:40.362] [ 112.293797][ C40] {1}[Hardware Error]: event severity: recoverable
[2026/3/24 16:45:40.362] [ 112.300139][ C40] {1}[Hardware Error]: Error 0, type: recoverable
[2026/3/24 16:45:40.363] [ 112.306481][ C40] {1}[Hardware Error]: section_type: ARM processor error
[2026/3/24 16:45:40.363] [ 112.313516][ C40] {1}[Hardware Error]: MIDR: 0x0000000000000000
[2026/3/24 16:45:40.363] [ 112.319771][ C40] {1}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081010400
[2026/3/24 16:45:40.363] [ 112.329151][ C40] {1}[Hardware Error]: error affinity level: 0
[2026/3/24 16:45:40.363] [ 112.335317][ C40] {1}[Hardware Error]: running state: 0x1
[2026/3/24 16:45:40.363] [ 112.341049][ C40] {1}[Hardware Error]: Power State Coordination Interface state: 0
[2026/3/24 16:45:40.363] [ 112.348953][ C40] {1}[Hardware Error]: Error info structure 0:
[2026/3/24 16:45:40.363] [ 112.355119][ C40] {1}[Hardware Error]: num errors: 1
[2026/3/24 16:45:40.363] [ 112.360418][ C40] {1}[Hardware Error]: error_type: 0, cache error
[2026/3/24 16:45:40.363] [ 112.366932][ C40] {1}[Hardware Error]: error_info: 0x0000000020400014
[2026/3/24 16:45:40.363] [ 112.373795][ C40] {1}[Hardware Error]: cache level: 1
[2026/3/24 16:45:40.453] [ 112.379354][ C40] {1}[Hardware Error]: the error has not been corrected
[2026/3/24 16:45:40.453] [ 112.386475][ C40] {1}[Hardware Error]: physical fault address: 0x0000001351811800
[2026/3/24 16:45:40.453] [ 112.394378][ C40] {1}[Hardware Error]: Vendor specific error info has 48 bytes:
[2026/3/24 16:45:40.453] [ 112.402027][ C40] {1}[Hardware Error]: 00000000: 00000000 00000000 00000000 00000000 ................
[2026/3/24 16:45:40.453] [ 112.411754][ C40] {1}[Hardware Error]: 00000010: 00000000 00000000 00000000 00000000 ................
[2026/3/24 16:45:40.453] [ 112.421480][ C40] {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................
[2026/3/24 16:45:40.453] [ 112.431639][ T9769] Memory failure: 0x1351811: recovery action for free buddy page: Recovered
[2026/3/24 16:45:40.531] [ 112.512520][ C40] This error has been reported before, don't process it again.
[2026/3/24 16:45:40.757] Bus error (core dumped)
>>>>>
>>>>> Fixes this by returning a failure when encountering the same error again.
>>>>>
>>>>> The following error logs is explained using the devmem process:
>>>>> NOTICE: SEA Handle
>>>>> NOTICE: SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>>>>> NOTICE: skt[0x0]die[0x0]cluster[0x0]core[0x1]
>>>>> NOTICE: EsrEl3 = 0x92000410
>>>>> NOTICE: PA is valid: 0x1000093c00
>>>>> NOTICE: Hest Set GenericError Data
>>>>> [ 1419.542401][ C1] {57}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
>>>>> [ 1419.551435][ C1] {57}[Hardware Error]: event severity: recoverable
>>>>> [ 1419.557865][ C1] {57}[Hardware Error]: Error 0, type: recoverable
>>>>> [ 1419.564295][ C1] {57}[Hardware Error]: section_type: ARM processor error
>>>>> [ 1419.571421][ C1] {57}[Hardware Error]: MIDR: 0x0000000000000000
>>>>> [ 1419.571434][ C1] {57}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081000100
>>>>> [ 1419.586813][ C1] {57}[Hardware Error]: error affinity level: 0
>>>>> [ 1419.586821][ C1] {57}[Hardware Error]: running state: 0x1
>>>>> [ 1419.602714][ C1] {57}[Hardware Error]: Power State Coordination Interface state: 0
>>>>> [ 1419.602724][ C1] {57}[Hardware Error]: Error info structure 0:
>>>>> [ 1419.614797][ C1] {57}[Hardware Error]: num errors: 1
>>>>> [ 1419.614804][ C1] {57}[Hardware Error]: error_type: 0, cache error
>>>>> [ 1419.629226][ C1] {57}[Hardware Error]: error_info: 0x0000000020400014
>>>>> [ 1419.629234][ C1] {57}[Hardware Error]: cache level: 1
>>>>> [ 1419.642006][ C1] {57}[Hardware Error]: the error has not been corrected
>>>>> [ 1419.642013][ C1] {57}[Hardware Error]: physical fault address: 0x0000001000093c00
>>>>> [ 1419.654001][ C1] {57}[Hardware Error]: Vendor specific error info has 48 bytes:
>>>>> [ 1419.654014][ C1] {57}[Hardware Error]: 00000000: 00000000 00000000 00000000 00000000 ................
>>>>> [ 1419.670685][ C1] {57}[Hardware Error]: 00000010: 00000000 00000000 00000000 00000000 ................
>>>>> [ 1419.670692][ C1] {57}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................
>>>>> [ 1419.783606][T54990] Memory failure: 0x1000093: recovery action for free buddy page: Recovered
>>>>> [ 1419.919580][ T9955] EDAC MC0: 1 UE Multi-bit ECC on unknown memory (node:0 card:1 module:71 bank:7 row:0 col:0 page:0x1000093 offset:0xc00 grain:1 - APEI location: node:0 card:257 module:71 bank:7 row:0 col:0)
>>>>> NOTICE: SEA Handle
>>>>> NOTICE: SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>>>>> NOTICE: skt[0x0]die[0x0]cluster[0x0]core[0x1]
>>>>> NOTICE: EsrEl3 = 0x92000410
>>>>> NOTICE: PA is valid: 0x1000093c00
>>>>> NOTICE: Hest Set GenericError Data
>>>>> NOTICE: SEA Handle
>>>>> NOTICE: SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>>>>> NOTICE: skt[0x0]die[0x0]cluster[0x0]core[0x1]
>>>>> NOTICE: EsrEl3 = 0x92000410
>>>>> NOTICE: PA is valid: 0x1000093c00
>>>>> NOTICE: Hest Set GenericError Data
>>>>> ...
>>>>> ... ---> Hapend SEA error interrupt storm
>>>>> ...
>>>>> NOTICE: SEA Handle
>>>>> NOTICE: SpsrEl3 = 0x60001000, ELR_EL3 = 0xffffc6ab42671400
>>>>> NOTICE: skt[0x0]die[0x0]cluster[0x0]core[0x1]
>>>>> NOTICE: EsrEl3 = 0x92000410
>>>>> NOTICE: PA is valid: 0x1000093c00
>>>>> NOTICE: Hest Set GenericError Data
>>>>> [ 1429.818080][ T9955] Memory failure: 0x1000093: already hardware poisoned
>>>>> [ 1429.825760][ C1] ghes_print_estatus: 1 callbacks suppressed
>>>>> [ 1429.825763][ C1] {59}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
>>>>> [ 1429.843731][ C1] {59}[Hardware Error]: event severity: recoverable
>>>>> [ 1429.861800][ C1] {59}[Hardware Error]: Error 0, type: recoverable
>>>>> [ 1429.874658][ C1] {59}[Hardware Error]: section_type: ARM processor error
>>>>> [ 1429.887516][ C1] {59}[Hardware Error]: MIDR: 0x0000000000000000
>>>>> [ 1429.901159][ C1] {59}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081000100
>>>>> [ 1429.901166][ C1] {59}[Hardware Error]: error affinity level: 0
>>>>> [ 1429.914896][ C1] {59}[Hardware Error]: running state: 0x1
>>>>> [ 1429.914903][ C1] {59}[Hardware Error]: Power State Coordination Interface state: 0
>>>>> [ 1429.933319][ C1] {59}[Hardware Error]: Error info structure 0:
>>>>> [ 1429.946261][ C1] {59}[Hardware Error]: num errors: 1
>>>>> [ 1429.946269][ C1] {59}[Hardware Error]: error_type: 0, cache error
>>>>> [ 1429.970847][ C1] {59}[Hardware Error]: error_info: 0x0000000020400014
>>>>> [ 1429.970854][ C1] {59}[Hardware Error]: cache level: 1
>>>>> [ 1429.988406][ C1] {59}[Hardware Error]: the error has not been corrected
>>>>> [ 1430.013419][ C1] {59}[Hardware Error]: physical fault address: 0x0000001000093c00
>>>>> [ 1430.013425][ C1] {59}[Hardware Error]: Vendor specific error info has 48 bytes:
>>>>> [ 1430.025424][ C1] {59}[Hardware Error]: 00000000: 00000000 00000000 00000000 00000000 ................
>>>>> [ 1430.053736][ C1] {59}[Hardware Error]: 00000010: 00000000 00000000 00000000 00000000 ................
>>>>> [ 1430.066341][ C1] {59}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................
>>>>> [ 1430.294255][T54990] Memory failure: 0x1000093: already hardware poisoned
>>>>> [ 1430.305518][T54990] 0x1000093: Sending SIGBUS to devmem:54990 due to hardware memory corruption
>>>>>
>>>>> Signed-off-by: Junhao He <hejunhao3@xxxxxxxxxxxxxx>
>>>>> ---
>>>>> drivers/acpi/apei/ghes.c | 4 +++-
>>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>>> index 005de10d80c3..eebda39bfc30 100644
>>>>> --- a/drivers/acpi/apei/ghes.c
>>>>> +++ b/drivers/acpi/apei/ghes.c
>>>>> @@ -1343,8 +1343,10 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>>>> ghes_clear_estatus(ghes, &tmp_header, buf_paddr, fixmap_idx);
>>>>>
>>>>> /* This error has been reported before, don't process it again. */
>>>>> - if (ghes_estatus_cached(estatus))
>>>>> + if (ghes_estatus_cached(estatus)) {
>>>>> + rc = -ECANCELED;
>>>>> goto no_work;
>>>>> + }
>>>>>
>>>>> llist_add(&estatus_node->llnode, &ghes_estatus_llist);
>>>>>
>>>>> --
>>>>
>>>> This needs a response from the APEI reviewers as per MAINTAINERS, thanks!
>>>
>>> Hi, Rafael and Junhao,
>>>
>>> Sorry for late response, I try to reproduce the issue, it seems that
>>> EINJ systems broken in 6.18.0-rc1+.
>>>
>>> [ 3950.741186] CPU: 36 UID: 0 PID: 74112 Comm: einj_mem_uc Tainted: G E 6.18.0-rc1+ #227 PREEMPT(none)
>>> [ 3950.751749] Tainted: [E]=UNSIGNED_MODULE
>>> [ 3950.755655] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS 1.91 07/29/2022
>>> [ 3950.763797] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [ 3950.770729] pc : acpi_os_write_memory+0x108/0x150
>>> [ 3950.775419] lr : acpi_os_write_memory+0x28/0x150
>>> [ 3950.780017] sp : ffff800093fbba40
>>> [ 3950.783319] x29: ffff800093fbba40 x28: 0000000000000000 x27: 0000000000000000
>>> [ 3950.790425] x26: 0000000000000002 x25: ffffffffffffffff x24: 000000403f20e400
>>> [ 3950.797530] x23: 0000000000000000 x22: 0000000000000008 x21: 000000000000ffff
>>> [ 3950.804635] x20: 0000000000000040 x19: 000000002f7d0018 x18: 0000000000000000
>>> [ 3950.811741] x17: 0000000000000000 x16: ffffae52d36ae5d0 x15: 000000001ba8e890
>>> [ 3950.818847] x14: 0000000000000000 x13: 0000000000000000 x12: 0000005fffffffff
>>> [ 3950.825952] x11: 0000000000000001 x10: ffff00400d761b90 x9 : ffffae52d365b198
>>> [ 3950.833058] x8 : 0000280000000000 x7 : 000000002f7d0018 x6 : ffffae52d5198548
>>> [ 3950.840164] x5 : 000000002f7d1000 x4 : 0000000000000018 x3 : ffff204016735060
>>> [ 3950.847269] x2 : 0000000000000040 x1 : 0000000000000000 x0 : ffff8000845bd018
>>> [ 3950.854376] Call trace:
>>> [ 3950.856814] acpi_os_write_memory+0x108/0x150 (P)
>>> [ 3950.861500] apei_write+0xb4/0xd0
>>> [ 3950.864806] apei_exec_write_register_value+0x88/0xc0
>>> [ 3950.869838] __apei_exec_run+0xac/0x120
>>> [ 3950.873659] __einj_error_inject+0x88/0x408 [einj]
>>> [ 3950.878434] einj_error_inject+0x168/0x1f0 [einj]
>>> [ 3950.883120] error_inject_set+0x48/0x60 [einj]
>>> [ 3950.887548] simple_attr_write_xsigned.constprop.0.isra.0+0x14c/0x1d0
>>> [ 3950.893964] simple_attr_write+0x1c/0x30
>>> [ 3950.897873] debugfs_attr_write+0x54/0xa0
>>> [ 3950.901870] vfs_write+0xc4/0x240
>>> [ 3950.905173] ksys_write+0x70/0x108
>>> [ 3950.908562] __arm64_sys_write+0x20/0x30
>>> [ 3950.912471] invoke_syscall+0x4c/0x110
>>> [ 3950.916207] el0_svc_common.constprop.0+0x44/0xe8
>>> [ 3950.920893] do_el0_svc+0x20/0x30
>>> [ 3950.924194] el0_svc+0x38/0x160
>>> [ 3950.927324] el0t_64_sync_handler+0x98/0xe0
>>> [ 3950.931491] el0t_64_sync+0x184/0x188
>>> [ 3950.935140] Code: 14000006 7101029f 54000221 d50332bf (f9000015)
>>> [ 3950.941210] ---[ end trace 0000000000000000 ]---
>>> [ 3950.945807] Kernel panic - not syncing: Oops: Fatal exception
>>>
>>> We need to fix it first.
>>
>> Hi shuai xue,
>>
>> Sorry for my late reply. Thank you for the review.
>> To clarify the issue:
>> This problem was introduced in v6.18-rc1 via a suspicious ARM64
>> memory mapping change [1]. I can reproduce the crash consistently
>> using the v6.18-rc1 kernel with this patch applied.
>>
>> Crucially, the crash disappears when the change is reverted — error
>> injection completes successfully without any kernel panic or oops.
>> This confirms that the ARM64 memory mapping change is the root cause.
>>
>> As noted in the original report, the change was reverted in v6.19-rc1, and
>> subsequent kernels (including v6.19-rc1 and later) are stable and do not
>> exhibit this problem.
>>
>> reproduce logs:
>> [ 216.347073] Unable to handle kernel write to read-only memory at virtual address ffff800084825018
>> ...
>> [ 216.475949] CPU: 75 UID: 0 PID: 11477 Comm: sh Kdump: loaded Not tainted 6.18.0-rc1+ #60 PREEMPT
>> [ 216.486561] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 1.91 07/29/2022
>> [ 216.587297] Call trace:
>> [ 216.589904] acpi_os_write_memory+0x188/0x1c8 (P)
>> [ 216.594763] apei_write+0xcc/0xe8
>> [ 216.598238] apei_exec_write_register_value+0x90/0xd0
>> [ 216.603437] __apei_exec_run+0xb0/0x128
>> [ 216.607420] __einj_error_inject+0xac/0x450
>> [ 216.611750] einj_error_inject+0x19c/0x220
>> [ 216.615988] error_inject_set+0x4c/0x68
>> [ 216.619962] simple_attr_write_xsigned.constprop.0.isra.0+0xe8/0x1b0
>> [ 216.626445] simple_attr_write+0x20/0x38
>> [ 216.630502] debugfs_attr_write+0x58/0xa8
>> [ 216.634643] vfs_write+0xdc/0x408
>> [ 216.638088] ksys_write+0x78/0x118
>> [ 216.641610] __arm64_sys_write+0x24/0x38
>> [ 216.645648] invoke_syscall+0x50/0x120
>> [ 216.649510] el0_svc_common.constprop.0+0xc8/0xf0
>> [ 216.654318] do_el0_svc+0x24/0x38
>> [ 216.657742] el0_svc+0x38/0x150
>> [ 216.660996] el0t_64_sync_handler+0xa0/0xe8
>> [ 216.665286] el0t_64_sync+0x1ac/0x1b0
>> [ 216.669054] Code: d65f03c0 710102ff 540001e1 d50332bf (f9000295)
>> [ 216.675244] ---[ end trace 0000000000000000 ]---
>>
>> [1] https://lore.kernel.org/all/20251121224611.07efa95a@xxxxxxx/
>>
>> Best regards,
>> Junhao.
>
> Thanks for clarify the issue.
>
> Thanks.
> Shuai
>
> .
>