Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer

From: Matthew Brost

Date: Wed Mar 18 2026 - 18:41:32 EST

On Tue, Mar 17, 2026 at 09:43:20PM +0100, Boris Brezillon wrote:
> Hi Matthew,
>
> Just a few drive-by comments.
>
> On Tue, 17 Mar 2026 11:14:36 -0700
> Matthew Brost <matthew.brost@xxxxxxxxx> wrote:
>
> > > > > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > > > > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > > > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > > > > expiry for device teardown.
> > > > >
> > > > > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > > > > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > > > > dma_fence callback). Dependency cleanup is deferred to process context
> > > > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> > > > >
> > > > > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > > > > bail immediately if the queue refcount has already reached zero and
> > > > > async teardown is in flight, preventing use-after-free.
> > > >
> > > > In rust, when you queue work, you have to pass a reference-counted pointer
> > > > (Arc<T>). We simply never have this problem in a Rust design. If there is work
> > > > queued, the queue is alive.
> > > >
> > > > By the way, why can’t we simply require synchronous teardowns?
> >
> > Consider the case where the DRM dep queue’s refcount drops to zero, but
> > the device firmware still holds references to the associated queue.
> > These are resources that must be torn down asynchronously. In Xe, I need
> > to send two asynchronous firmware commands before I can safely remove
> > the memory associated with the queue (faulting on this kind of global
> > memory will take down the device) and recycle the firmware ID tied to
> > the queue. These async commands are issued on the driver side, on the
> > DRM dep queue’s workqueue as well.
>
> Asynchronous teardown is okay, but I'm not too sure using the refcnt to
> know that the queue is no longer usable is the way to go. To me the
> refcnt is what determines when the SW object is no longer referenced by
> any other item in the code, and a work item acting on the queue counts
> as one owner of this queue. If you want to cancel the work in order to
> speed up the destruction of the queue, you can call
> {cancel,disable}_work[_sync](), and have the ref dropped if the
> cancel/disable was effective. Multi-step teardown is also an option,
> but again, the state of the queue shouldn't be determined from its
> refcnt IMHO.
>
> >
> > Now consider a scenario where something goes wrong and those firmware
> > commands never complete, and a device reset is required to recover. The
> > driver’s per-queue tracking logic stops all queues (including zombie
> > ones), determines which commands were lost, cleans up the side effects
> > of that lost state, and then restarts all queues. That is how we would
> > end up in this work item with a zombie queue. The restart logic could
> > probably be made smart enough to avoid queueing work for zombie queues,
> > but in my opinion it’s safe enough to use kref_get_unless_zero() in the
> > work items.
>
> Well, that only works for single-step teardown, or when you enter the
> last step. At which point, I'm not too sure it's significantly better
> than encoding the state of the queue through a separate field, and have
> the job queue logic reject new jobs if the queue is no longer usable
> (shouldn't even be exposed to userland at this point though).
>

'shouldn't even be exposed to userland at this point though' - Yes.

The philosophy of ref counting design is roughly:

- When queue is created by userland call drm_dep_queue_init
- All jobs hold a ref to drm_dep_queue
- When userland close queue, remove it from the FD, call
drm_dep_queue_put + initiatie teardown (I'd recommned just setting TDR
to immediately fire, kick off queue from device in first fire + signal
all fences).
- Queue refcount goes to zero optionality implement
drm_dep_queue_ops.fini to keep the drm_dep_queue (and object it is
embedded) around for a bit longer if additional firmware / device side
resources are still around, call drm_dep_queue_fini when this part
completes. If drm_dep_queue_ops.fini isn't implement the core
implementation just calls drm_dep_queue_fini.
- Work item releases drm_dep_queue out dma-fence signaling path for safe
memory release (e.g., take dma-resv locks).

> >
> > It should also be clear that a DRM dep queue is primarily intended to be
> > embedded inside the driver’s own queue object, even though it is valid
> > to use it as a standalone object. The async teardown flows are also
> > optional features.
> >
> > Let’s also consider a case where you do not need the async firmware
> > flows described above, but the DRM dep queue is still embedded in a
> > driver-side object that owns memory via dma-resv. The final queue put
> > may occur in IRQ context (DRM dep avoids kicking a worker just to drop a
> > refi as opt in), or in the reclaim path (any scheduler workqueue is the
> > reclaim path). In either case, you cannot free memory there taking a
> > dma-resv lock, which is why all DRM dep queues ultimately free their
> > resources in a work item outside of reclaim. Many drivers already follow
> > this pattern, but in DRM dep this behavior is built-in.
>
> I agree deferred cleanup is the way to go.
>

+1. Yes. I spot a bunch of drivers that open code this part driver side,
Xe included.

> >
> > So I don’t think Rust natively solves these types of problems, although
> > I’ll concede that it does make refcounting a bit more sane.
>
> Rust won't magically defer the cleanup, nor will it dictate how you want
> to do the queue teardown, those are things you need to implement. But it
> should give visibility about object lifetimes, and guarantee that an
> object that's still visible to some owners is usable (the notion of
> usable is highly dependent on the object implementation).
>
> Just a purely theoretical example of a multi-step queue teardown that
> might be possible to encode in rust:
>
> - MyJobQueue<Usable>: The job queue is currently exposed and usable.
> There's a ::destroy() method consuming 'self' and returning a
> MyJobQueue<Destroyed> object
> - MyJobQueue<Destroyed>: The user asked for the workqueue to be
> destroyed. No new job can be pushed. Existing jobs that didn't make
> it to the FW queue are cancelled, jobs that are in-flight are
> cancelled if they can, or are just waited upon if they can't. When
> the whole destruction step is done, ::destroyed() is called, it
> consumes 'self' and returns a MyJobQueue<Inactive> object.
> - MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have
> any resources on this queue). It's ready to be cleaned up.
> ::cleanup() (or just ::drop()) defers the cleanup of some inner
> object that has been passed around between the various
> MyJobQueue<State> wrappers.
>
> Each of the state transition can happen asynchronously. A state
> transition consumes the object in one state, and returns a new object
> in its new state. None of the transition involves dropping a refcnt,
> ownership is just transferred. The final MyJobQueue<Inactive> object is
> the object we'll defer cleanup on.
>
> It's a very high-level view of one way this can be implemented (I'm
> sure there are others, probably better than my suggestion) in order to
> make sure the object doesn't go away without the compiler enforcing
> proper state transitions.
>

I'm sure Rust can implement this. My point about Rust is it doesn't
magically solve hard software arch probles, but I will admit the
ownership model, way it can enforce locking at compile time is pretty
cool.

> > > > > +/**
> > > > > + * DOC: DRM dependency fence
> > > > > + *
> > > > > + * Each struct drm_dep_job has an associated struct drm_dep_fence that
> > > > > + * provides a single dma_fence (@finished) signalled when the hardware
> > > > > + * completes the job.
> > > > > + *
> > > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is stored as
> > > > > + * @parent. @finished is chained to @parent via drm_dep_job_done_cb() and
> > > > > + * is signalled once @parent signals (or immediately if run_job() returns
> > > > > + * NULL or an error).
> > > >
> > > > I thought this fence proxy mechanism was going away due to recent work being
> > > > carried out by Christian?
> > > >
> >
> > Consider the case where a driver’s hardware fence is implemented as a
> > dma-fence-array or dma-fence-chain. You cannot install these types of
> > fences into a dma-resv or into syncobjs, so a proxy fence is useful
> > here.
>
> Hm, so that's a driver returning a dma_fence_array/chain through
> ::run_job()? Why would we not want to have them directly exposed and
> split up into singular fence objects at resv insertion time (I don't
> think syncobjs care, but I might be wrong). I mean, one of the point

You can stick dma-fence-arrays in syncobjs, but not chains.

Neither dma-fence-arrays/chain can go into dma-resv.

Hence why disconnecting a job's finished fence from hardware fence IMO
is good idea to keep so gives drivers flexiblity on the hardware fences.
e.g., If this design didn't have a job's finished fence, I'd have to
open code one Xe side.

> behind the container extraction is so fences coming from the same
> context/timeline can be detected and merged. If you insert the
> container through a proxy, you're defeating the whole fence merging
> optimization.

Right. Finished fences have single timeline too...

>
> The second thing is that I'm not sure drivers were ever supposed to
> return fence containers in the first place, because the whole idea
> behind a fence context is that fences are emitted/signalled in
> seqno-order, and if the fence is encoding the state of multiple
> timelines that progress at their own pace, it becomes tricky to control
> that. I guess if it's always the same set of timelines that are
> combined, that would work.

Xe does this is definitely works. We submit to multiple rings, when all
rings signal a seqno, a chain or array signals -> finished fence
signals. The queues used in this manor can only submit multiple ring
jobs so the finished fence timeline stays intact. If you could a
multiple rings followed by a single ring submission on the same queue,
yes this could break.

>
> > One example is when a single job submits work to multiple rings
> > that are flipped in hardware at the same time.
>
> We do have that in Panthor, but that's all explicit: in a single
> SUBMIT, you can have multiple jobs targeting different queues, each of
> them having their own set of deps/signal ops. The combination of all the
> signal ops into a container is left to the UMD. It could be automated
> kernel side, but that would be a flag on the SIGNAL op leading to the
> creation of a fence_array containing fences from multiple submitted
> jobs, rather than the driver combining stuff in the fence it returns in
> ::run_job().

See above. We have a dedicated queue type for these type of submissions
and single job that submits to the all rings. We had multiple queue /
jobs in the i915 to implemented this but it turns out it is much cleaner
with a single queue / singler job / multiple rings model.

>
> >
> > Another case is late arming of hardware fences in run_job (which many
> > drivers do). The proxy fence is immediately available at arm time and
> > can be installed into dma-resv or syncobjs even though the actual
> > hardware fence is not yet available. I think most drivers could be
> > refactored to make the hardware fence immediately available at run_job,
> > though.
>
> Yep, I also think we can arm the driver fence early in the case of
> JobQueue. The reason it couldn't be done before is because the
> scheduler was in the middle, deciding which entity to pull the next job
> from, which was changing the seqno a job driver-fence would be assigned
> (you can't guess that at queue time in that case).
>

Xe doesn't need to late arming, but it look like multiple drivers to
implement the late arming which may be required (?).

> [...]
>
> > > > > + * **Reference counting**
> > > > > + *
> > > > > + * Jobs and queues are both reference counted.
> > > > > + *
> > > > > + * A job holds a reference to its queue from drm_dep_job_init() until
> > > > > + * drm_dep_job_put() drops the job's last reference and its release callback
> > > > > + * runs. This ensures the queue remains valid for the entire lifetime of any
> > > > > + * job that was submitted to it.
> > > > > + *
> > > > > + * The queue holds its own reference to a job for as long as the job is
> > > > > + * internally tracked: from the moment the job is added to the pending list
> > > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the put_job
> > > > > + * worker, which calls drm_dep_job_put() to release that reference.
> > > >
> > > > Why not simply keep track that the job was completed, instead of relinquishing
> > > > the reference? We can then release the reference once the job is cleaned up
> > > > (by the queue, using a worker) in process context.
> >
> > I think that’s what I’m doing, while also allowing an opt-in path to
> > drop the job reference when it signals (in IRQ context)
>
> Did you mean in !IRQ (or !atomic) context here? Feels weird to not
> defer the cleanup when you're in an IRQ/atomic context, but defer it
> when you're in a thread context.
>

The put of a job in this design can be from an IRQ context (opt-in)
feature. xa_destroy blows up if it is called from an IRQ context,
although maybe that could be workaround.

> > so we avoid
> > switching to a work item just to drop a ref. That seems like a
> > significant win in terms of CPU cycles.
>
> Well, the cleanup path is probably not where latency matters the most.

Agree. But I do think avoiding a CPU context switch (work item) for a
very lightweight job cleanup (usually just drop refs) will save of CPU
cycles, thus also things like power, etc...

> It's adding scheduling overhead, sure, but given all the stuff we defer
> already, I'm not too sure we're at saving a few cycles to get the
> cleanup done immediately. What's important to have is a way to signal
> fences in an atomic context, because this has an impact on latency.
>

Yes. The signaling happens first then drm_dep_job_put if IRQ opt-in.

> [...]
>
> > > > > + /*
> > > > > + * Drop all input dependency fences now, in process context, before the
> > > > > + * final job put. Once the job is on the pending list its last reference
> > > > > + * may be dropped from a dma_fence callback (IRQ context), where calling
> > > > > + * xa_destroy() would be unsafe.
> > > > > + */
> > > >
> > > > I assume that “pending” is the list of jobs that have been handed to the driver
> > > > via ops->run_job()?
> > > >
> > > > Can’t this problem be solved by not doing anything inside a dma_fence callback
> > > > other than scheduling the queue worker?
> > > >
> >
> > Yes, this code is required to support dropping job refs directly in the
> > dma-fence callback (an opt-in feature). Again, this seems like a
> > significant win in terms of CPU cycles, although I haven’t collected
> > data yet.
>
> If it significantly hurts the perf, I'd like to understand why, because
> to me it looks like pure-cleanup (no signaling involved), and thus no
> other process waiting for us to do the cleanup. The only thing that
> might have an impact is how fast you release the resources, and given
> it's only a partial cleanup (xa_destroy() still has to be deferred), I'd
> like to understand which part of the immediate cleanup is causing a
> contention (basically which kind of resources the system is starving of)
>

It was more of once we moved to a ref counted model, it is pretty
trivial allow drm_dep_job_put when the fence is signaling. It doesn't
really add any complexity either, thus why I added it is.

Matt

> Regards,
>
> Boris