Re: [PATCH] sched: Further restrict the preemption modes

From: Sebastian Andrzej Siewior

Date: Fri Jun 05 2026 - 06:56:27 EST

On 2026-06-05 11:43:24 [+0100], Ciunas Bennett wrote:
…
> Quick refresh:
> Workload: uperf sending TCP data between two VMs (client and server), each configured with a single vhost queue (min vhost ques for testing)
> Issue: With lazy preemption as the default preemption mode where previously it was full preemption, there is a significant drop in performance for this workload
>
> Simplification of the issue
> We have two tasks:
>
> TaskA produces data
> TaskB consumes the data produced by TaskA
>
> Notification path: TaskA informs TaskB that new data is available by
> adding a new item to a workqueue. This triggers a kworker which runs
> and notifies TaskB.
>
> Issue
> TaskA is configured to use schedule_work(). Internally, schedule_work() uses system_percpu_wq, which is configured as:
> <WQ_PERCPU = 1 << 8, /* bound to a specific cpu */>
>
> This means the workqueue item will be woken up and executed on the same CPU that queued the work.
> If the task that queues the work (TaskA) is a long-running task with
> limited opportunities to call schedule(), then the kworker may be
> delayed significantly before it gets CPU time.

There is some work done by Marco to rework the API to explicitly state
if a per-CPU workqueue is mandatory _or_ if an CPU unbound workqueue can
be used instead. (Rather than having schedule_work() not knowing the
implications).

> In our scenario:
>
> TaskA continuously produces data
> There is no dependency requiring TaskA to yield due to TaskB
> As a result, TaskA can occupy the CPU for an entire tick before being preempted by the kworker
>
> Observed behavior
> This is exactly what we observe in practice:
>
> TaskB corresponds to the VM consuming data generated by our vhost task
> When running uperf, this behavior leads to a significant drop in throughput (Gb/s)
> The VM is unable to consume data in a timely manner
> When it is finally notified of new data, the delayed signaling introduces jitter
> This causes TCP issues, including retransmissions and out-of-order packets
>
> Results:
> |--------------+-----+------------------+------------------------|
> | preempt mode | Gbs | workqueue pool | kworker latency avg ms |
> |--------------+-----+------------------+------------------------|
> | full | ~50 | system_percpu_wq | 0.002 |
> | lazy | ~13 | system_percpu_wq | 0.721 |
> | lazy | ~50 | system_dfl_wq | 0.005 |
> |--------------+-----+------------------+------------------------|
>
> So I did some more testing and if I use a different workqueue pool the system_dfl_wq the TP was good again, as you can see in the results table.
> Since the kworker is not CPU-bound, the scheduler has flexibility to select a more suitable CPU for execution.
>
> /* system_dfl_wq is unbound workqueue. Workers are not bound to
> * any specific CPU, not concurrency managed, and all queued works are
> * executed immediately as long as max_active limit is not reached and
> * resources are available. */
>
> Given this understanding, what would be the best approach here? Should
> we consider changing the workqueue usage in the KVM code, or do you
> see an alternative way to address this issue?

It seems that using an unbound worker would avoid the problem at hand,
correct?

Sebastian