Re: [PATCH v3] mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison

From: mawupeng

Date: Thu May 21 2026 - 05:33:10 EST




On 周四 2026-5-21 16:48, Oscar Salvador (SUSE) wrote:
> On Wed, May 20, 2026 at 07:24:28PM +0800, mawupeng wrote:
>
>> You are correct. The refcount dropping logic in the `unmap` path was indeed flawed.
>> This issue was originally uncovered by fuzzing. Based on the initial stack trace,
>> we diagnosed it as a recursive locking (AA) deadlock on `hugetlb_lock`.
>>
>> We initially suspected that `unmap` had prematurely released the folio reference
>> count, triggering the free path. However, after a thorough analysis of the refcount
>> state machine and the actual execution context, we confirmed that this hypothesis
>> is impossible. The root cause lies elsewhere in the locking hierarchy, and we are
>> currently tracing the exact call path that leads to the nested `hugetlb_lock`
>> acquisition.
>>
>> The deadlock can be triggered by injecting hardware poison errors on a hugetlb
>> page while concurrent unmapping activity occurs. The following minimal userspace
>> test case demonstrates the race condition by spawning multiple processes to
>> widen the timing window for the lock contention.
>
>
> After staring at it, it is obvious the code is wrong.
> We __should__ not be calling folio_put under the lock, as recursion will
> happen if we are the last user holding a reference.
> Thinking about it, I cannot think of a way we would need nesting here.
>
> Anyway, this is a genuine bug, so thanks for that, but it all got very
> confusing because of the traces pointing to wwrong places.
> The thing is quite simple:
>
> - We start with the assumption that a hugetlb folio is mapped to
> userspace and that madvise
>
> thread#0 thread#1
> madvise(folio, MADV_HWPOISON) (we poisoned the page)
> madvise(folio, MADV_HWPOISON) (second call)
> unmap(folio)
> try_memory_failure_hugetlb
> get_huge_page_for_hwpoison (takes lock)
> __get_huge_page_for_hwpoison
> hugetlb_update_hwpoison
> - we get MF_HUGETLB_FOLIO_PRE_POISONED
> we jump to out which does
> folio_put
> free_huge_page (takes lock.. yaiks)
>
>
> So yes, the fix is to have the folio_put happening not within the lock.
>
> Please, send the patch with the right changelog (and no version) and I will ack it.

Thanks for reviewing.

I will send new version with right changelog.

>
>
>