Re: [PATCH rc v2] iommu: Fix nested pci_dev_reset_iommu_prepare/done()

From: Shuai Xue

Date: Thu Mar 19 2026 - 07:19:31 EST

On 3/19/26 12:31 PM, Nicolin Chen wrote:

Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done().

As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
device reset.

On the other hand, removing the outer calls in the PCI callers is unsafe.
As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
execute custom firmware waits after their inner pcie_flr() completes. If
the IOMMU protection relies solely on the inner reset, the IOMMU will be
unblocked prematurely while the device is still resetting.

Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.

Introduce a 'reset_cnt' in struct iommu_group. Safely increment/decrement
the reference counter in the nested calls, ensuring the IOMMU domains are
only restored when the outermost reset finally completes.

Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
Cc: stable@xxxxxxxxxxxxxxx
Reported-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
Suggested-by: Kevin Tian <kevin.tian@xxxxxxxxx>
Signed-off-by: Nicolin Chen <nicolinc@xxxxxxxxxx>
---
Changelog
v2:
* Fix in the helpers by allowing re-entry
v1:
https://lore.kernel.org/all/20260318220028.1146905-1-nicolinc@xxxxxxxxxx/

drivers/iommu/iommu.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 35db51780954..16155097b27c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -68,6 +68,7 @@ struct iommu_group {
struct iommu_domain *resetting_domain;
struct iommu_domain *domain;
struct list_head entry;
+ unsigned int reset_cnt;

Nit: consider renaming reset_cnt to reset_depth or
reset_nesting to better convey that this tracks nesting level,
not a count of total resets.

unsigned int owner_cnt;
void *owner;
};
@@ -3961,9 +3962,10 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
guard(mutex)(&group->mutex);
- /* Re-entry is not allowed */
- if (WARN_ON(group->resetting_domain))
- return -EBUSY;
+ if (group->resetting_domain) {
+ group->reset_cnt++;
+ return 0;
+ }
ret = __iommu_group_alloc_blocking_domain(group);

pci_dev_reset_iommu_prepare/done() have NO singleton group check.
They operate on the specific pdev passed in, but use group-wide
state (resetting_domain, reset_cnt) to track the reset lifecycle.

Interestingly, the broken_worker in patch 3 of the ATC timeout
series DOES have an explicit singleton check:

if (list_is_singular(&group->devices)) {
/* Note: only support group with a single device */

This reveals an implicit assumption: the entire prepare/done
mechanism works correctly only for singleton groups. For
multi-device groups:

- prepare() only detaches the specific pdev, leaving other
devices in the group still attached to the original domain
- The group-wide resetting_domain/reset_cnt state can be
corrupted by concurrent resets on different devices (as
discussed above)

If prepare/done is truly meant only for singleton groups, it
should enforce this explicitly:

if (!list_is_singular(&group->devices))
return -EOPNOTSUPP;

If it's meant to support multi-device groups, then the per-device
vs group-wide state mismatch needs to be resolved — either by
making the state per-device, or by detaching/restoring all
devices in the group together.

Thanks.
Shuai