[PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock
From: Sonam Sanju
Date: Mon Mar 16 2026 - 03:12:22 EST
From: Sonam Sanju <ssanju@xxxxxxxxx>
irqfd_resampler_shutdown() calls synchronize_srcu_expedited() while
holding kvm->irqfds.resampler_lock. This can deadlock when multiple
irqfd_shutdown workers run concurrently on the kvm-irqfd-cleanup
workqueue during VM teardown (e.g. crosvm shutdown on Android):
CPU A (mutex holder) CPU B/C/D (mutex waiters)
irqfd_shutdown() irqfd_shutdown()
irqfd_resampler_shutdown() irqfd_resampler_shutdown()
mutex_lock(resampler_lock) <---- mutex_lock(resampler_lock) // BLOCKED
list_del_rcu(...) ...blocked...
synchronize_srcu_expedited() // Waiters block workqueue,
// waits for SRCU grace preventing SRCU grace
// period which requires period from completing
// workqueue progress --- DEADLOCK ---
The synchronize_srcu_expedited() in the else branch is called directly
within the mutex. In the if-last branch, kvm_unregister_irq_ack_notifier()
also calls synchronize_srcu_expedited() internally. Both paths can
block indefinitely because:
1. synchronize_srcu_expedited() waits for an SRCU grace period
2. SRCU grace period completion needs workqueue workers to run
3. The blocked mutex waiters occupy workqueue slots, preventing
progress
4. The mutex holder never releases the lock -> deadlock
Fix by performing all list manipulations and the last-entry check under
the mutex, then releasing the mutex before the SRCU synchronization.
This is safe because:
- list_del_rcu() removes the irqfd from resampler->list under the
mutex, so no concurrent reader or writer can access it.
- When last==true, list_del_rcu(&resampler->link) has already removed
the resampler from kvm->irqfds.resampler_list under the mutex, so
no other worker can find or operate on this resampler.
- kvm_unregister_irq_ack_notifier() uses its own locking
(kvm->irq_lock) and is safe to call without resampler_lock.
- synchronize_srcu_expedited() does not require any KVM mutex.
- kfree(resampler) is safe after SRCU sync guarantees all readers
have finished.
Signed-off-by: Sonam Sanju <ssanju@xxxxxxxxx>
---
virt/kvm/eventfd.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 0e8b8a2c5b79..27bcf2b1a81d 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -93,6 +93,7 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
{
struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler;
struct kvm *kvm = resampler->kvm;
+ bool last = false;
mutex_lock(&kvm->irqfds.resampler_lock);
@@ -100,19 +101,27 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
if (list_empty(&resampler->list)) {
list_del_rcu(&resampler->link);
+ last = true;
+ }
+
+ mutex_unlock(&kvm->irqfds.resampler_lock);
+
+ /*
+ * synchronize_srcu_expedited() (called explicitly below, or internally
+ * by kvm_unregister_irq_ack_notifier()) must not be invoked under
+ * resampler_lock. Holding the mutex while waiting for an SRCU grace
+ * period creates a deadlock: the blocked mutex waiters occupy workqueue
+ * slots that the SRCU grace period machinery needs to make forward
+ * progress.
+ */
+ if (last) {
kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
- /*
- * synchronize_srcu_expedited(&kvm->irq_srcu) already called
- * in kvm_unregister_irq_ack_notifier().
- */
kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
resampler->notifier.gsi, 0, false);
kfree(resampler);
} else {
synchronize_srcu_expedited(&kvm->irq_srcu);
}
-
- mutex_unlock(&kvm->irqfds.resampler_lock);
}
/*
--
2.34.1