Re: drm_sched run_job and scheduling latency
From: Tvrtko Ursulin
Date: Fri Mar 27 2026 - 05:24:56 EST
On 05/03/2026 09:40, Boris Brezillon wrote:
Hi Tvrtko,
On Thu, 5 Mar 2026 08:35:33 +0000
Tvrtko Ursulin <tursulin@xxxxxxxxxxx> wrote:
On 04/03/2026 22:51, Chia-I Wu wrote:
Hi,
Our system compositor (surfaceflinger on android) submits gpu jobs
from a SCHED_FIFO thread to an RT gpu queue. However, because
workqueue threads are SCHED_NORMAL, the scheduling latency from submit
to run_job can sometimes cause frame misses. We are seeing this on
panthor and xe, but the issue should be common to all drm_sched users.
Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won't
meet future android requirements). It seems either workqueue needs to
gain RT support, or drm_sched needs to support kthread_worker.
I know drm_sched switched from kthread_worker to workqueue for better
From a plain kthread actually.
Oops, sorry, I hadn't seen your reply before posting mine. I basically
said the same.
Anyway, I suggested trying the
kthread_worker approach a few times in the past but never got round
implementing it. Not dual paths but simply replacing the workqueues with
kthread_workers.
What is your thinking regarding how would the priority be configured? In
terms of the default and mechanism to select a higher priority
scheduling class.
If we follow the same model that exists today, where the
workqueue can be passed at drm_sched_init() time, it becomes the
driver's responsibility to create a worker of his own with the right
prio set (using sched_setscheduler()). There's still the case where the
worker is NULL, in which case the drm_sched code can probably create
his own worker and leave it with the default prio, just like existed
before the transition to workqueues.
It's a whole different story if you want to deal with worker pools and
do some load balancing though...
I prototyped this in xe in the mean time and it is looking plausible that latency can be significantly reduced.
First to say that I did not go as far as worker pools because at the moment I don't see an use case for it. At least not for xe.
When 1:1 entity to scheduler drivers appeared kthreads were undesirable just because they were ending up with effectively unbound number of kernel threads. There was no benefit to that but only downsides. Workqueues were good since they manage the thread pool under the hood, but it is just a handy coincidence, the design still misses to express the optimal number of CPU threads required to feed a GPU engine. For example with xe, if there was a 4096 CPU machine with 4096 user contexts feeding to the same GPU engine, the optimal number of CPU threads to feed it is really more like one rather than how much wq management decided to run in parallel. They all end up hammering on the same lock to let the firmware know there is something to schedule.
For this reason in my prototype I create kthread_worker per hardware execution engine. (For xe even that could potentially be too much, maybe I should even try one kthread_worker per GuC CT.)
This creates a requirement for 1:1 drivers to not use the "worker" auto-create mode of the DRM scheduler so TBD if that is okay.
Anyway, onto the numbers. Well actually first onto a benchmark I hacked up.. I took xe_blt from IGT and modified it heavily to be more reasonable. What it essentially does it emits a constant stream of synchronous blit operations and measures the variance of the time each took to complete, as observed by the submitting process. In parallel it spawns a number of CPU hog threads to oversubscribe the system. And it can run the submitting thread at either normal priority, re-niced to -1, or at SCHED_FIFO. This is to simulate a typical compositor use case.
Now onto the numbers.
normal nice FIFO
wq 100% 76% 1%
kthread_worker 100% 73% 1.2%
└─relative to wq: 50.5% 48.5% 58.9%
Median "jitter" (variance in observed job submissions) is normalised and shows how changing the CPU priority changes the jitter observed by the submission thread. First two rows are the current wq implementation and the kthread_worker conversion. They show scaling as roughly similar.
Third row are the kworker_thread results normalised against wq. And that shows roughly twice as low jitter. So a meaningful improvement.
Then I went a step further to even better address the analysis of a problem done by Chia-I, solving the priority inversion problem. That is to loosely track CPU priorities of the currently active entities submitting to each scheduler (and in turn kthread_worker). This in turn further improved the latency numbers for the SCHED_FIFO case, albeit there is a strange anomaly with re-nice which I will come to later. It looks like this:
normal nice FIFO
kworker_follow_prio 100% 277% 0.66%
└─relative to wq: 60% 222% 37.8%
This effectively means that with a SCHED_FIFO compositor the submission round-trip latency could be around a third of what the current scheduler can do.
Now the re-nice anomaly.. This is something I am yet to investigate. Issue may be related to what I said the kthread_workers loosely track the submission thread priority. Loosely meaning if they detect negative nice they do not follow the exact nice level but go minium nice, while my test program was using the least minimum nice level (-19 vs -1). Perhaps that causes some strange effect in the CPU scheduler. I do not know yet but it is very interesting that it appears repeatable.
It is also important to view my numbers as with some margin of error. I have tried to remove the effect of intel_pstate, CPU turbo, and thermal management to a large extent, but I do not think I fully succeeded yet. There may be some +/- of 5% or so in the results is my gut feeling.
Also important to say is that the prototype depends on my other DRM scheduler series (the fair scheduler one), since I needed the nicer sched_rq abstraction with better tracking of active entities to implement priority inheritance, so I am unlikely to post it all as RFC since Philipp would possible get a heart attack if I did. :)
To close, I think this is interesting to check out further and could look at converting panthor next and then we could run more experiments.
Regards,
Tvrtko