Re: [PATCH] khugepaged: Reduce race probability between migration and khugepaged

From: Dev Jain
Date: Mon Jun 30 2025 - 04:41:07 EST



On 30/06/25 1:49 pm, David Hildenbrand wrote:
On 30.06.25 10:12, Dev Jain wrote:

On 30/06/25 1:25 pm, Anshuman Khandual wrote:
On 30/06/25 10:18 AM, Dev Jain wrote:
Suppose a folio is under migration, and khugepaged is also trying to
collapse it. collapse_pte_mapped_thp() will retrieve the folio from the
page cache via filemap_lock_folio(), thus taking a reference on the folio
and sleeping on the folio lock, since the lock is held by the migration
path. Migration will then fail in
__folio_migrate_mapping -> folio_ref_freeze. Reduce the probability of
such a race happening (leading to migration failure) by bailing out
if we detect a PMD is marked with a migration entry.
Could the migration be re-attempted after such failure ? Seems like
the migration failure here is traded for a scan failure instead.

We already re-attempt migration. See NR_MAX_MIGRATE_PAGES_RETRY and
NR_MAX_MIGRATE_ASYNC_RETRY. Also just before freezing the refcount,
we do a suitable refcount check in folio_migrate_mapping(). So the
race happens after this and folio_ref_freeze() in __folio_migrate_mapping(),
therefore the window for the race is already very small in the migration
path, but large in the khugepaged path.


This fixes the migration-shared-anon-thp testcase failure on Apple M3.
Could you please provide some more context why this test case was
failing earlier and how does this change here fixes the problem ?

IMHO the explanation I have given in the patch description is clear
and succinct: the testcase is failing due to the race. This patch
shortens the race window, and the test on this particular hardware
does not hit the race window again.


Note that, this is not a "fix" since it only reduces the chance of
interference of khugepaged with migration, wherein both the kernel
functionalities are deemed "best-effort".
Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
---

This patch was part of
https://lore.kernel.org/all/20250625055806.82645-1-dev.jain@xxxxxxx/
but I have sent it separately on suggestion of Lorenzo, and also because
I plan to send the first two patches after David Hildenbrand's
folio_pte_batch series gets merged.

   mm/khugepaged.c | 12 ++++++++++--
   1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1aa7ca67c756..99977bb9bf6a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -31,6 +31,7 @@ enum scan_result {
       SCAN_FAIL,
       SCAN_SUCCEED,
       SCAN_PMD_NULL,
+    SCAN_PMD_MIGRATION,
       SCAN_PMD_NONE,
       SCAN_PMD_MAPPED,
       SCAN_EXCEED_NONE_PTE,
@@ -941,6 +942,8 @@ static inline int check_pmd_state(pmd_t *pmd)
          if (pmd_none(pmde))
           return SCAN_PMD_NONE;
+    if (is_pmd_migration_entry(pmde))
+        return SCAN_PMD_MIGRATION;
       if (!pmd_present(pmde))
           return SCAN_PMD_NULL;
       if (pmd_trans_huge(pmde))
@@ -1502,9 +1505,12 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
           !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
           return SCAN_VMA_CHECK;
   -    /* Fast check before locking page if already PMD-mapped */
+    /*
+     * Fast check before locking folio if already PMD-mapped, or if the
+     * folio is under migration
+     */
       result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
-    if (result == SCAN_PMD_MAPPED)
+    if (result == SCAN_PMD_MAPPED || result == SCAN_PMD_MIGRATION)
Should mapped PMD and migrating PMD be treated equally while scanning ?

SCAN_PMD_MAPPED is used as an indicator to change result to SCAN_SUCCEED
in khugepaged_scan_mm_slot: after the call to collapse_pte_mapped_thp. And,
it is also used in madvise_collapse() to do ++thps which is used to set the
return value of madvise_collapse. So I think this approach will be wrong.

But if it already is PMD mapped (just temporarily through a migration entry), isn't this exactly what we want?

Good point. I was about to say that what about PMD-folio splitting during migration, but then

during unmapping the folios, if we cannot migrate the PMD-folios, they will be splitted via

unmap_folio() along with the PMD, therefore we can be sure that if we encounter a PMD

migration entry, then eventually it will be converted to a PMD leaf entry on migration success

or failure.


I'll merge this into SCAN_PMD_MAPPED, thanks.