Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()

From: Sebastian Andrzej Siewior

Date: Mon Jun 01 2026 - 09:49:30 EST

On 2026-06-01 14:01:07 [+0100], David Woodhouse wrote:
> On Mon, 2026-06-01 at 11:52 +0100, David Woodhouse wrote:
> > On Sat, 2026-05-30 at 16:40 +0200, Paolo Bonzini wrote:
> > > On Sat, May 30, 2026 at 3:04 PM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Sat, 2026-05-30 at 12:26 +0200, Paolo Bonzini wrote:
> > > > >
> > > > > Yeah, I think so.
> > > > >
> > > > > The write side needs kvm->srcu so it would have to be yet another SRCU.
> > > > > I initially thought that sucks for the code that calls kvm_gpc_check(),
> > > > > but maybe not because it simply replaces read_lock/read_unlock.
> > > > >
> > > > > By using a seqcount for the data, SRCU only needs to be synchronized in
> > > > > gpc_unmap(). So, something like this:
> > > >
> > > > It isn't just gpc_unmap() which does the invalidation. We also
> > > > invalidate from the MMU notifier in gfn_to_pfn_cache_invalidate_start()
> > > > which would also have to synchronize, wouldn't it?
> > >
> > > You're right, the write_lock_irq() there drains the readers and that
> > > is needed because khva is not pinned, only kmap()-ed.
> > >
> > > That is already broken for the OOM case under PREEMPT_RT, where
> > > rwlock_t becomes sleepable. But using SRCU would break it on
> > > !PREEMPT_RT as well.
> >
> > I don't think 'sleepable' is the problem per se, is it? *Why* does the
> > OOM killer use mmu_notifier_invalidate_range_start_nonblock()?
> >
> > Commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
> > notifiers") did say:
> >
> >     There are several blockable mmu notifiers which might sleep in
> >     mmu_notifier_invalidate_range_start and that is a problem for the
> >     oom_reaper because it needs to guarantee a forward progress so it cannot
> >     depend on any sleepable locks.
> >
> > But that was in 2018, when mmap_lock was an rw_semaphore.
> >
> > Is "sleepable" still a problem even when PREEMPT_RT where almost
> > *everything* is now strictly sleepable? Wouldn't that mean drivers
> > aren't even allowed to take their own spinlocks?
…

> And then we see it even when taking kvm->mn_invalidate_lock:
>
> kvm_mmu_notifier_invalidate_range_start+0xac
> 0xffffffff8132732c is in kvm_mmu_notifier_invalidate_range_start (arch/x86/kvm/../../../virt/kvm/kvm_main.c:745).
> 740 * adjustments will be imbalanced.
> 741 *
> 742 * Pairs with the decrement in range_end().
> 743 */
> 744 spin_lock(&kvm->mn_invalidate_lock);
> 745 kvm->mn_active_invalidate_count++;
> 746 if (!mmu_notifier_range_blockable(range))
> 747 pr_info("KVM: non-blockable invalidate_range_start, non_block_count=%d\n", current->non_block_count);
> 748 spin_unlock(&kvm->mn_invalidate_lock);
> 749
>
>
> [ 427.919969] mmap: exit_mmap: delaying before mmu_notifier_release for kvm_oom_test
> [ 429.926972] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
> [ 429.926978] in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 280, name: oom_reaper
> [ 429.926982] preempt_count: 0, expected: 0
> [ 429.926984] RCU nest depth: 0, expected: 0
> [ 429.926986] 4 locks held by oom_reaper/280:
> [ 429.926989] #0: ffff8a61da779cb0 (&mm->mmap_lock){....}-{3:3}, at: oom_reaper+0x150/0x520
> [ 429.927006] #1: ffffffffa0934f20 (mmu_notifier_invalidate_range_start){....}-{0:0}, at: zap_vma_for_reaping+0xb7/0x1d0
> [ 429.927019] #2: ffffffffa0934f78 (srcu){....}-{0:0}, at: __mmu_notifier_invalidate_range_start+0xae/0x340
> [ 429.927029] #3: ffff8a6240295360 (&kvm->mn_invalidate_lock){....}-{2:2}, at: kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
> [ 429.927044] CPU: 26 UID: 0 PID: 280 Comm: oom_reaper Tainted: G S I 7.1.0-rc2+ #2460 PREEMPT_{RT,(lazy)}
> [ 429.927051] Tainted: [S]=CPU_OUT_OF_SPEC, [I]=FIRMWARE_WORKAROUND
> [ 429.927053] Hardware name: Intel Corporation S2600CW/S2600CW, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
> [ 429.927055] Call Trace:
> [ 429.927058] <TASK>
> [ 429.927062] dump_stack_lvl+0x6e/0xa0
> [ 429.927074] __might_resched.cold+0xeb/0x100
> [ 429.927084] rt_spin_lock+0x6c/0x1a0
> [ 429.927092] ? kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
> [ 429.927102] kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
> [ 429.927110] ? sched_update_numa+0xa0/0x270
> [ 429.927129] __mmu_notifier_invalidate_range_start+0x129/0x340

Okay. This complains about non_block:

…
> [ 429.927260] KVM: non-blockable invalidate_range_start, non_block_count=1

and commit 312364f3534cc ("kernel.h: Add non_block_start/end()") says

| Peter also asked whether we want to catch spinlocks on top, but Michal
| said those are less of a problem because spinlocks can't have an indirect
| dependency upon the page allocator and hence close the loop with the oom
| reaper.

so a lock which becomes sleep-able on RT vs !RT shouldn't be a problem,
right? We also don't complain about about scheduling within a
rcu_read_lock() section if it is part of spin_lock().

Sebastian