Re: [RFC PATCH v4 5/6] drivers/migrate_offload: add DMA batch copy driver (dcbm)

From: Garg, Shivank

Date: Fri Apr 24 2026 - 07:31:23 EST

On 4/23/2026 7:43 PM, Vinod Koul wrote:
> On 23-04-26, 17:40, Garg, Shivank wrote:
>> Hi Vinod,
>>
>> Following your suggestion at the Kernel meetup in Bangalore (11 Apr 2026)
>> to check 0cae04373b ("dmaengine: remove DMA_MEMCPY_SG once again") and use
>> DMA_MEMCPY_SG / dmaengine_prep_dma_memcpy_sg() (I added a
>> device_prep_dma_memcpy_sg hook in drivers/dma/amd/ptdma/ptdma-dmaengine.c
>> for this experiment; not posted).
>> I ran an A/B comparison against the existing DCBM path that uses
>> dmaengine_prep_dma_memcpy() in a loop over mapped SGL segments.
>>
>> I'm using the move_pages() workload to move 1 GB data per run. I do not see
>> significant performance difference, and results are broadly within each
>> other's noise band).
>>
>> Throughput (GB/s, mean ± SD), ITERATIONS=10:
>>
>> Page nr_dma_chan=1 nr_dma_chan=4 nr_dma_chan=8 nr_dma_chan=16
>> order dcbm dcbm_sg dcbm dcbm_sg dcbm dcbm_sg dcbm dcbm_sg
>> ------ ----------- ---------- ----------- ---------- ----------- ---------- ------------ ----------
>> 0 2.33 ± 0.17 2.26 ± 0.19 3.24 ± 0.21 3.18 ± 0.23 3.29 ± 0.10 3.45 ± 0.10 3.29 ± 0.13 3.49 ± 0.22
>> 4 2.77 ± 0.21 2.99 ± 0.18 6.26 ± 0.99 6.75 ± 0.12 8.01 ± 0.58 7.70 ± 0.64 8.22 ± 0.89 8.72 ± 0.87
>> 8 4.57 ± 0.70 4.75 ± 0.83 10.64 ± 1.97 10.94 ± 3.52 10.30 ± 1.22 10.36 ± 1.24 11.27 ± 1.21 12.47 ± 1.66
>> 9 12.71 ± 0.09 12.68 ± 0.08 27.13 ± 0.15 26.89 ± 0.27 46.50 ± 0.73 45.17 ± 2.46 67.25 ± 1.42 62.78 ± 8.24
>>
>> Notes: order 0/4/8/9 = 4K / 64K / 1M / 2M folios
>> dcbm = per-segment dmaengine_prep_dma_memcpy
>> dcbm_sg = DMA_MEMCPY_SG / dmaengine_prep_dma_memcpy_sg
>>
>> <snip>
>>
>>> +
>>> +static int submit_dma_transfers(struct dma_work *work)
>>> +{
>>> + struct scatterlist *sg_src, *sg_dst;
>>> + struct dma_async_tx_descriptor *tx;
>>> + unsigned long flags = DMA_CTRL_ACK;
>>> + dma_cookie_t cookie;
>>> + int i;
>>> +
>>> + atomic_set(&work->pending, 1);
>>> +
>>> + sg_src = work->src_sgt->sgl;
>>> + sg_dst = work->dst_sgt->sgl;
>>> + for_each_sgtable_dma_sg(work->src_sgt, sg_src, i) {
>>> + if (i == work->src_sgt->nents - 1)
>>> + flags |= DMA_PREP_INTERRUPT;
>>> +
>>> + tx = dmaengine_prep_dma_memcpy(work->chan,
>>> + sg_dma_address(sg_dst),
>>> + sg_dma_address(sg_src),
>>> + sg_dma_len(sg_src), flags);
>>> + if (!tx) {
>>> + atomic_set(&work->pending, 0);
>>> + return -EIO;
>>> + }
>>> +
>>> + if (i == work->src_sgt->nents - 1) {
>>> + tx->callback = dma_completion_callback;
>>> + tx->callback_param = work;
>>> + }
>>> +
>>> + cookie = dmaengine_submit(tx);
>>> + if (dma_submit_error(cookie)) {
>>> + atomic_set(&work->pending, 0);
>>> + return -EIO;
>>> + }
>>> + sg_dst = sg_next(sg_dst);
>>> + }
>>> + return 0;
>>> +}
>>
>> static int submit_dma_transfers(struct dma_work *work)
>> {
>> struct dma_async_tx_descriptor *tx;
>> unsigned long flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>> dma_cookie_t cookie;
>>
>> tx = dmaengine_prep_dma_memcpy_sg(work->chan,
>> work->dst_sgt->sgl, work->dst_sgt->nents,
>> work->src_sgt->sgl, work->src_sgt->nents,
>> flags);
>> if (!tx)
>> return -EIO;
>>
>> atomic_set(&work->pending, 1);
>> tx->callback = dma_completion_callback;
>> tx->callback_param = work;
>>
>> cookie = dmaengine_submit(tx);
>> if (dma_submit_error(cookie)) {
>> atomic_set(&work->pending, 0);
>> return -EIO;
>> }
>> return 0;
>> }
>>
>> The memcpy_sg version does simplify submit_dma_transfers()
>> (one dmaengine_prep_dma_memcpy_sg + one dmaengine_submit vs a loop).
>
> Right
>
>>
>> My current DCBM path issues dmaengine_prep_dma_memcpy()+dmaengine_submit()
>> per mapped SG segment and sets DMA_PREP_INTERRUPT + callback only
>> on the last one, so the IRQ/callback cost is already one per batch.
>>
>> My understanding is switching to dmaengine_prep_dma_memcpy_sg() mainly
>> saves the per-segment prep/submit calls and hands the provider a single
>> multi-segment TX to program.
>
> Right, but the analysis you showed indicated the dma setup cost was
> quite a bit, this moving away from N transfers to single one should have
> saved a bit more...
>
>>
>> Please correct me if the benefit you had in mind is something stronger.
>> Thanks for the suggestion and for guidance.
>
> I still feel this looks better version...
> Can you compare your setup time between the two please

I wrote a small dmaengine bench module to isolate the setup prep overheads from full migration path.

prep_memcpy: loop of dmaengine_prep_dma_memcpy(), one descriptor per SG entry, single completion callback on the last tx (same pattern my driver use currently).
prep_memcpy_sg: one dmaengine_prep_dma_memcpy_sg() per batch, so the provider walks the mapped src/dst SGLs (proposed)

Instrumented with ktime_get() for each phase - prep / submit / issue / wait.
Happy to share the module and the runner script if useful.

Workload: Copy 512 MB/channel, 20 runs/cell, src_nid=0 dst_nid=1, Folio sizes 4KB/2MB, batch = 512 SG entries.
*_ms columns are thread-time summed across channels (for c=16 divide by 16 for per-channel time)
run_ms is wall time to copy the 512MB.
prep_calls: total number of dmaengine_prep_dma_memcpy{,_sg}() (512X less for memcpy_sg)

mode chan folio sge run_ms prep_ms submit_ms issue_ms wait_ms prep_calls
prep_memcpy 1 4KB 512 632.86 ± 8.18 18.00 ± 6.38 4.44 ± 0.09 0.09 ± 0.04 603.54 ± 5.03 131072 (= 512MB/4KB)
prep_memcpy_sg 1 4KB 512 611.34 ± 13.52 0.74 ± 0.33 0.01 ± 0.00 0.08 ± 0.00 610.48 ± 13.68 256 (= prep_memcpy calls / 512)

prep_memcpy 16 4KB 512 675.70 ± 14.13 416.19 ± 27.49 79.19 ± 2.27 1.53 ± 0.12 9590.11 ± 206.81 2097152
prep_memcpy_sg 16 4KB 512 615.43 ± 11.55 19.61 ± 3.38 0.17 ± 0.03 1.55 ± 0.16 9202.33 ± 138.41 4096

prep_memcpy 1 2MB 512 77.19 ± 0.15 0.04 ± 0.02 0.02 ± 0.00 0.00 ± 0.00 77.10 ± 0.15 512
prep_memcpy_sg 1 2MB 512 77.21 ± 0.11 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 77.21 ± 0.11 1

prep_memcpy 16 2MB 512 186.01 ± 0.40 2.31 ± 0.17 0.32 ± 0.03 0.00 ± 0.00 2712.56 ± 4.24 8192
prep_memcpy_sg 16 2MB 512 185.63 ± 0.37 0.09 ± 0.02 0.00 ± 0.00 0.00 ± 0.00 2711.20 ± 3.75 16

dmaengine_prep_dma_memcpy_sg() is a clear win (fewer preps, fewer submits, no per-tx callback bookkeeping).
However, the end-to-end throughput gain was modest earlier because migration path cost and per-descriptor execution
time (wait_ms) dominates.

Thanks,
Shivank