[PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices
From: Tal Zussman
Date: Wed Mar 25 2026 - 15:46:44 EST
Add support for using RWF_DONTCACHE with block devices.
Dropbehind pruning needs to be done in non-IRQ context, but block
devices complete writeback in IRQ context.
To fix this, we can defer dropbehind invalidation to task context. We
introduce a new BIO_COMPLETE_IN_TASK flag that allows the bio submitter
to request task-context completion of bi_end_io. When bio_endio() sees
this flag in non-task context, it queues the bio to a per-CPU list and
schedules a work item to do bio completion.
Patch 1 adds the BIO_COMPLETE_IN_TASK infrastructure in the block
layer.
Patch 2 wires BIO_COMPLETE_IN_TASK into iomap writeback for DONTCACHE
folios and removes the DONTCACHE workqueue deferral from XFS.
Patch 3 enables RWF_DONTCACHE for block devices, setting
BIO_COMPLETE_IN_TASK in submit_bh_wbc() for the CONFIG_BUFFER_HEAD
path.
This support is useful for databases that operate on raw block devices,
among other userspace applications.
I tested this (with CONFIG_BUFFER_HEAD=y) for reads and writes on a
single block device on a VM, so results may be noisy.
Reads were tested on the root partition with a 45GB range (~2x RAM).
Writes were tested on a disabled swap parition (~1GB) in a memcg of size
244MB to force reclaim pressure.
Results:
===== READS (/dev/nvme0n1p2) =====
sec normal MB/s dontcache MB/s
---- ------------ --------------
1 1098.6 1609.0
2 1270.3 1506.6
3 1093.3 1576.5
4 1141.8 2393.9
5 1365.3 2793.8
6 1324.6 2065.9
7 879.6 1920.7
8 1434.1 1662.4
9 1184.9 1857.9
10 1166.4 1702.8
11 1161.4 1653.4
12 1086.9 1555.4
13 1198.5 1718.9
14 1111.9 1752.2
---- ------------ --------------
avg 1173.7 1828.8 (+56%)
==== WRITES (/dev/nvme0n1p3) =====
sec normal MB/s dontcache MB/s
---- ------------ --------------
1 692.4 9297.7
2 4810.8 9342.8
3 5221.7 2955.2
4 396.7 8488.3
5 7249.2 9249.3
6 6695.4 1376.2
7 122.9 9125.8
8 5486.5 9414.7
9 6921.5 8743.5
10 27.9 8997.8
---- ------------ --------------
avg 3762.5 7699.1 (+105%)
---
Changes in v4:
- 1/3: Move dropbehind deferral from folio-level to bio-level using
BIO_COMPLETE_IN_TASK, per Matthew and Jan.
- 1/3: Work function yields on need_resched() to avoid hogging the CPU,
per Jan.
- 2/3: New patch. Set BIO_COMPLETE_IN_TASK on iomap writeback bios for
DONTCACHE folios, removing the need for XFS-specific workqueue
deferral.
- 3/3: Set BIO_COMPLETE_IN_TASK in submit_bh_wbc() for buffer_head
path.
- 3/3: Update commit message to mention CONFIG_BUFFER_HEAD=n path.
- Link to v3: https://lore.kernel.org/r/20260227-blk-dontcache-v3-0-cd309ccd5868@xxxxxxxxxxxx
Changes in v3:
- 1/2: Convert dropbehind deferral to per-CPU folio_batches protected by
local_lock using per-CPU work items, to reduce contention, per Jens.
- 1/2: Call folio_end_dropbehind_irq() directly from
folio_end_writeback(), per Jens.
- 1/2: Add CPU hotplug dead callback to drain the departing CPU's folio
batch.
- 2/2: Introduce block_write_begin_iocb(), per Christoph.
- 2/2: Dropped R-b due to changes.
- Link to v2: https://lore.kernel.org/r/20260225-blk-dontcache-v2-0-70e7ac4f7108@xxxxxxxxxxxx
Changes in v2:
- Add R-b from Jan Kara for 2/2.
- Add patch to defer dropbehind completion from IRQ context via a work
item (1/2).
- Add initial performance numbers to cover letter.
- Link to v1: https://lore.kernel.org/r/20260218-blk-dontcache-v1-1-fad6675ef71f@xxxxxxxxxxxx
---
Tal Zussman (3):
block: add BIO_COMPLETE_IN_TASK for task-context completion
iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
block: enable RWF_DONTCACHE for block devices
block/bio.c | 84 ++++++++++++++++++++++++++++++++++++++++++++-
block/fops.c | 5 +--
fs/buffer.c | 22 ++++++++++--
fs/iomap/ioend.c | 2 ++
fs/xfs/xfs_aops.c | 4 ---
include/linux/blk_types.h | 1 +
include/linux/buffer_head.h | 3 ++
7 files changed, 111 insertions(+), 10 deletions(-)
---
base-commit: 2961f841b025fb234860bac26dfb7fa7cb0fb122
change-id: 20260218-blk-dontcache-338133dd045e
Best regards,
--
Tal Zussman <tz2294@xxxxxxxxxxxx>