Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

From: Garg, Shivank

Date: Wed May 20 2026 - 12:13:18 EST

On 5/12/2026 7:45 AM, Huang, Ying wrote:
> "Garg, Shivank" <shivankg@xxxxxxx> writes:
>
>> On 5/9/2026 1:19 PM, Huang, Ying wrote:
>>> "Garg, Shivank" <shivankg@xxxxxxx> writes:
>>>
>>>> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>>>>> Hi, Shivank,
>>>>>
>>>>> "Garg, Shivank" <shivankg@xxxxxxx> writes:
>>>>>
>>>>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>>>>> Shivank Garg <shivankg@xxxxxxx> writes:
>>>>>>

>> Thanks. Below tables sweep NR_MAX_BATCHED_MIGRATION from 512 up to 262144. On 2M folios,
>> 16-channel PTDMA, the knee is at N=8192-16384 (= {16 to 32} * 512 ).
>>
>>>>>> 8192 12.56 | 2424 26.57 | 1118 58.72 | 470 *
>
> 2048 12.30 | 613 25.47 | 290 25.48 | 291
>
> IIUC, N=2048 already helps dma4. And, the latency looks OK too. The
> good batch size is hardware configuration dependent too? If so, we may
> need to add another migrator callback for that.

Yeah, right.

>> One thing worth flagging on the "bounded default": at the upstream cap of 512 pages,
>> migrate_pages_batch() receives at most one 2M folio per call, so PTDMA can only use
>> one of its 16 channels per batch and the offload reduces to vanilla. (DCBM offloads
>> one 2M folio to each channel).
>> The larger-N rows are what exercise the channel parallelism for PTDMA case.
>>
>> "SDXI"[1] like memory-to-memory data movers should reach good throughput with just 1 channel,
>> and thus may not require increasing the NR_MAX_BATCHED_MIGRATION for good throughput.
>>
>> I'm not tying series this to specific perf default for now, the design review (batch-copy
>> path, migrator interface, registration, static_call dispatch) is the part I'd like to converge
>> on first, then tune the threshold after it. Does that ordering work?
>
> IMHO, we need some performance data to justify the added complexity.
> So, threshold tuning isn't the goal, whether we can get better
> throughput with some bounded latency is.
>

Fair point.

As you pointed PTDMA-4chan gives both tput and inaccessible time improvement.
PTDMA 16-chan could be preferred by some who value throuhgput more.
And this is different for different hardware. So, I think another callback for
this sounds like good idea.

Thanks,
Shivank