Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer

From: Matthew Brost

Date: Tue Mar 24 2026 - 12:14:43 EST

On Tue, Mar 24, 2026 at 10:23:45AM +0100, Boris Brezillon wrote:
> On Mon, 23 Mar 2026 11:38:06 -0700
> Matthew Brost <matthew.brost@xxxxxxxxx> wrote:
>
> >
> > Ok, getting stats is easier than I thought...
> >
> > ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions /home/mbrost/xe/source/drivers.gpu.i915.igt-gpu-tools/build/tests/xe_exec_threads --r threads-basic
> >
> > This test creates one thread per engine instance (7 instances this BMG
> > device) and submits 1k exec IOCTLs per thread, each performing a DW
> > write. Each exec IOCTL typically does not have unsignaled input dependencies.
> >
> > With IRQ putting of jobs off + no bypass (drm_dep_queue_flags = 0):
> >
> > 8,449 context-switches
> > 412 cpu-migrations
> > 2,531.43 msec task-clock
> > 1,847,846,588 cpu_atom/cycles/
> > 1,847,856,947 cpu_core/cycles/
> > <not supported> cpu_atom/instructions/
> > 460,744,020 cpu_core/instructions/
> >
> > With IRQ putting of jobs off + bypass (drm_dep_queue_flags =
> > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED):
> >
> > 8,655 context-switches
> > 229 cpu-migrations
> > 2,571.33 msec task-clock
> > 855,900,607 cpu_atom/cycles/
> > 855,900,272 cpu_core/cycles/
> > <not supported> cpu_atom/instructions/
> > 403,651,469 cpu_core/instructions/
> >
> > With IRQ putting of jobs on + bypass (drm_dep_queue_flags =
> > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED |
> > DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE):
> >
> > 5,361 context-switches
> > 169 cpu-migrations
> > 2,577.44 msec task-clock
> > 685,769,153 cpu_atom/cycles/
> > 685,768,407 cpu_core/cycles/
> > <not supported> cpu_atom/instructions/
> > 321,336,297 cpu_core/instructions/
>
> Thanks for sharing those numbers. For completeness, can you also add the
> "With IRQ putting of jobs on + no bypass" case?
>

Yes, I also will share a DRM sched baseline too + I figured out power
can be measured too - initial results confirm what I expected too - less
power.

I'm putting together a doc based on running glxgears and another
benchmark on top Ubuntu 24.10 + Wayland which has explicit sync
(linux-drm-syncobj, behaves like surfface flinger when rendering flag to
not pass in fences to draw jobs).

Almost have all the data. Will share here once I have it.

> I'm a bit surprised by the difference in number of context switches
> given I'd expect the local-CPU to be picked in priority, and so queuing
> work items on the same wq from another work item to be almost free in
> term on scheduling. But I guess there's some load-balancing happening
> when you execute jobs at such a high rate.
>
> Also, I don't know if that's just noise or if it's reproducible, but
> task-clock seems to be ~40usec lower with the deferred cleanup and
> no-bypass (higher throughput because you're not blocking the dequeuing
> of the next job on the cleanup of the previous one, I suspect).

I think that is just noise of what the test is doing in user space -
that bounces around a bit.

Matt

>