On 30.06.25 10:12, Dev Jain wrote:
On 30/06/25 1:25 pm, Anshuman Khandual wrote:
On 30/06/25 10:18 AM, Dev Jain wrote:
Suppose a folio is under migration, and khugepaged is also trying toCould the migration be re-attempted after such failure ? Seems like
collapse it. collapse_pte_mapped_thp() will retrieve the folio from the
page cache via filemap_lock_folio(), thus taking a reference on the folio
and sleeping on the folio lock, since the lock is held by the migration
path. Migration will then fail in
__folio_migrate_mapping -> folio_ref_freeze. Reduce the probability of
such a race happening (leading to migration failure) by bailing out
if we detect a PMD is marked with a migration entry.
the migration failure here is traded for a scan failure instead.
We already re-attempt migration. See NR_MAX_MIGRATE_PAGES_RETRY and
NR_MAX_MIGRATE_ASYNC_RETRY. Also just before freezing the refcount,
we do a suitable refcount check in folio_migrate_mapping(). So the
race happens after this and folio_ref_freeze() in __folio_migrate_mapping(),
therefore the window for the race is already very small in the migration
path, but large in the khugepaged path.
This fixes the migration-shared-anon-thp testcase failure on Apple M3.Could you please provide some more context why this test case was
failing earlier and how does this change here fixes the problem ?
IMHO the explanation I have given in the patch description is clear
and succinct: the testcase is failing due to the race. This patch
shortens the race window, and the test on this particular hardware
does not hit the race window again.
Note that, this is not a "fix" since it only reduces the chance ofShould mapped PMD and migrating PMD be treated equally while scanning ?
interference of khugepaged with migration, wherein both the kernel
functionalities are deemed "best-effort".
Signed-off-by: Dev Jain <dev.jain@xxxxxxx>---
This patch was part of
https://lore.kernel.org/all/20250625055806.82645-1-dev.jain@xxxxxxx/
but I have sent it separately on suggestion of Lorenzo, and also because
I plan to send the first two patches after David Hildenbrand's
folio_pte_batch series gets merged.
mm/khugepaged.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1aa7ca67c756..99977bb9bf6a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -31,6 +31,7 @@ enum scan_result {
SCAN_FAIL,
SCAN_SUCCEED,
SCAN_PMD_NULL,
+ SCAN_PMD_MIGRATION,
SCAN_PMD_NONE,
SCAN_PMD_MAPPED,
SCAN_EXCEED_NONE_PTE,
@@ -941,6 +942,8 @@ static inline int check_pmd_state(pmd_t *pmd)
if (pmd_none(pmde))
return SCAN_PMD_NONE;
+ if (is_pmd_migration_entry(pmde))
+ return SCAN_PMD_MIGRATION;
if (!pmd_present(pmde))
return SCAN_PMD_NULL;
if (pmd_trans_huge(pmde))
@@ -1502,9 +1505,12 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
!range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
return SCAN_VMA_CHECK;
- /* Fast check before locking page if already PMD-mapped */
+ /*
+ * Fast check before locking folio if already PMD-mapped, or if the
+ * folio is under migration
+ */
result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
- if (result == SCAN_PMD_MAPPED)
+ if (result == SCAN_PMD_MAPPED || result == SCAN_PMD_MIGRATION)
SCAN_PMD_MAPPED is used as an indicator to change result to SCAN_SUCCEED
in khugepaged_scan_mm_slot: after the call to collapse_pte_mapped_thp. And,
it is also used in madvise_collapse() to do ++thps which is used to set the
return value of madvise_collapse. So I think this approach will be wrong.
But if it already is PMD mapped (just temporarily through a migration entry), isn't this exactly what we want?