[PATCH] locking/rwsem: Remove reader optimistic lock stealing

From: Peng Wang

Date: Thu May 21 2026 - 06:10:30 EST


Reader optimistic lock stealing, introduced by commit 1a728dff855a
("locking/rwsem: Enable reader optimistic lock stealing") and made more
aggressive by commit 617f3ef95177 ("locking/rwsem: Remove reader
optimistic spinning"), allows a reader entering the slowpath to bypass
the wait queue and acquire the lock directly when WRITER_LOCKED and
HANDOFF bits are not set.

This causes severe writer starvation in workloads where readers hold
the lock for extended periods, such as Direct I/O operations which
hold inode->i_rwsem for the entire duration of iomap_dio_rw(). A
common example is log-structured storage where one thread appends via
DIO writes while another thread tails the log via DIO reads -- a
pattern seen in database redo-log replay and shared-storage
replication.

The problem occurs because:

1. A reader entering the slowpath (due to RWSEM_FLAG_WAITERS being set)
can still steal the lock as the steal condition only checks
WRITER_LOCKED and HANDOFF, not WAITERS. This is inconsistent with
the fast path which already blocks readers when WAITERS is set (via
RWSEM_READ_FAILED_MASK).

2. In the window between the last reader releasing the lock and the
waiting writer being scheduled (~10-100us), a new reader can steal
the lock in ~50-100ns. This race is structurally inevitable due to
the 1000x speed difference between atomic operations vs context
switching.

3. Each stolen read lock is held for the full DIO duration (potentially
milliseconds), and the pattern repeats until the 4ms HANDOFF timeout.
This effectively taxes every write operation with a ~4ms penalty.

Performance impact measured with a DIO mixed read/write workload
(1 writer + 1 reader, O_DIRECT, ext4):

NVMe SSD: Before After
Write-only baseline: 397 MB/s 397 MB/s (no change)
Mixed write throughput: 11 MB/s 350 MB/s (+31x)
Mixed write latency: 880 us 23 us (-38x)
Mixed read throughput: 95 MB/s 95 MB/s (no change)

Fixes: 617f3ef95177 ("locking/rwsem: Remove reader optimistic spinning")
Signed-off-by: Peng Wang <peng_wang@xxxxxxxxxxxxxxxxx>
---
kernel/locking/lock_events_list.h | 1 -
kernel/locking/rwsem.c | 33 ---------------------------------
2 files changed, 34 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 97fb6f3f840a..35b45576bee4 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -65,7 +65,6 @@ LOCK_EVENT(rwsem_opt_lock) /* # of opt-acquired write locks */
LOCK_EVENT(rwsem_opt_fail) /* # of failed optspins */
LOCK_EVENT(rwsem_opt_nospin) /* # of disabled optspins */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
-LOCK_EVENT(rwsem_rlock_steal) /* # of read locks by lock stealing */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoffs */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index bda5577339c0..40b141c5765f 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1017,42 +1017,9 @@ static struct rw_semaphore __sched *
rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int state)
{
long adjustment = -RWSEM_READER_BIAS;
- long rcnt = (count >> RWSEM_READER_SHIFT);
struct rwsem_waiter waiter, *first;
DEFINE_WAKE_Q(wake_q);

- /*
- * To prevent a constant stream of readers from starving a sleeping
- * writer, don't attempt optimistic lock stealing if the lock is
- * very likely owned by readers.
- */
- if ((atomic_long_read(&sem->owner) & RWSEM_READER_OWNED) &&
- (rcnt > 1) && !(count & RWSEM_WRITER_LOCKED))
- goto queue;
-
- /*
- * Reader optimistic lock stealing.
- */
- if (!(count & (RWSEM_WRITER_LOCKED | RWSEM_FLAG_HANDOFF))) {
- rwsem_set_reader_owned(sem);
- lockevent_inc(rwsem_rlock_steal);
-
- /*
- * Wake up other readers in the wait queue if it is
- * the first reader.
- */
- if ((rcnt == 1) && (count & RWSEM_FLAG_WAITERS)) {
- raw_spin_lock_irq(&sem->wait_lock);
- if (sem->first_waiter)
- rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
- &wake_q);
- raw_spin_unlock_irq(&sem->wait_lock);
- wake_up_q(&wake_q);
- }
- return sem;
- }
-
-queue:
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_READ;
waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
--
2.39.3