Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope

Next message: Sasha Levin: "[PATCH AUTOSEL 6.19-5.10] ALSA: hda/realtek: Add headset jack quirk for Thinkpad X390"
Previous message: Sasha Levin: "[PATCH AUTOSEL 6.19-6.1] btrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol create"
Next in thread: Chuck Lever: "Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Breno Leitao

Date: Tue Mar 17 2026 - 07:36:30 EST

Hello Tejun,

On Fri, Mar 13, 2026 at 07:57:20AM -1000, Tejun Heo wrote:
> Hello,
>
> Applied 1/5. Some comments on the rest:
>
> - The sharding currently splits on CPU boundary, which can split SMT
> siblings across different pods. The worse performance on Intel compared
> to SMT scope may be indicating exactly this - HT siblings ending up in
> different pods. It'd be better to shard on core boundary so that SMT
> siblings always stay together.

Thank you for the insight. I'll modify the sharding to operate at the
core boundary rather than at the SMT/thread level to ensure sibling CPUs
remain in the same pod.

> - How was the default shard size of 8 picked? There's a tradeoff
> between the number of kworkers created and locality. Can you also
> report the number of kworkers for each configuration? And is there
> data on different shard sizes? It'd be useful to see how the numbers
> change across e.g. 4, 8, 16, 32.

The choice of 8 as the default shard size was somewhat arbitrary – it was
selected primarily to generate initial data points.

I'll run tests with different shard sizes and report the results.

I'm currently working on finding a suitable workload with minimal noise.
Testing on real NVMe devices shows significant jitter that makes analysis
difficult. I've also been experimenting with nullblk, but haven't had much
success yet.

If you have any suggestions for a reliable workload or benchmark, I'd
appreciate your input.

> - Can you also test on AMD machines? Their CCD topology (16 or 32
> threads per LLC) would be a good data point.

Absolutely, I'll test on AMD machines as well.

Thanks,
--breno