Re: [PATCH v18 3/4] vfio/pci: Add a reset_done callback for vfio-pci driver
From: Alex Williamson
Date: Thu Jun 04 2026 - 15:57:35 EST
On Thu, 4 Jun 2026 10:17:04 -0700
Farhan Ali <alifm@xxxxxxxxxxxxx> wrote:
> On 6/4/2026 1:28 AM, Keith Busch wrote:
> > On Wed, Jun 03, 2026 at 11:24:14AM -0700, Farhan Ali wrote:
> >> +static void vfio_pci_core_aer_reset_done(struct pci_dev *pdev)
> >> +{
> >> + struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
> >> +
> >> + if (!vdev->pci_saved_state)
> >> + return;
> >> +
> >> + pci_load_saved_state(pdev, vdev->pci_saved_state);
> >> + pci_restore_state(pdev);
> >> +}
> > Shouldn't there be a cooresponding user space notification that the
> > device has been restored? There's an eventfd on the error detected side
> > so user space can know the device needs recovery, but how does it come
> > to know that the reset is completed?
>
> I think if the VFIO_DEVICE_RESET ioctl completes successfully it should
> be an indication that the reset has completed? AFAIU the ioctl will
> drive a reset via pci_try_reset_function(). If reset completes completes
> successfully the reset_done() callback is called via pci_dev_restore().
> So I don't think we need an eventfd to notify on reset completion.
> Otherwise we would have the same problem today, where userspace is
> unaware that VFIO_DEVICE_RESET did indeed successfully reset the device,
> no? Or am I missing something?
I'm starting to feel a little sketchy about this. I asked claude to
enumerate the state restores and the source of that restored state.
Hopefully this ascii table survives:
┌──────────────────────────┬────────────────────────┬─────────────────────┐
│ Step │ Source │ Snapshot-dependent? │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ │ EXP cap save buffer │ │
│ pci_restore_pcie_state │ (pci_find_saved_cap, │ YES │
│ │ cap.data) │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ │ live │ │
│ pci_restore_pasid_state │ pdev->pasid_enabled + │ no │
│ │ pasid_features │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_pri_state │ live pdev->pri_enabled │ no │
│ │ + pri_reqs_alloc │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_ats_state │ live dev->ats_enabled │ no │
│ │ + ats_stu │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_vc_state │ VC ext-cap save buffer │ YES │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ │ live resource_size() │ │
│ pci_restore_rebar_state │ (re-derived, written │ no │
│ │ to hw) │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_dpc_state │ DPC ext-cap save │ YES │
│ │ buffer │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_ptm_state │ PTM ext-cap save │ YES │
│ │ buffer │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ │ TPH ext-cap save │ │
│ pci_restore_tph_state │ buffer, gated on live │ YES (gated) │
│ │ tph_enabled │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_aer_clear_status │ clears hw status (not │ n/a │
│ │ a restore) │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_aer_state │ ERR ext-cap save │ YES │
│ │ buffer │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ │ saved_config_space[16] │ │
│ pci_restore_config_space │ — type-0 header │ YES │
│ │ (COMMAND, BARs, │ │
│ │ cacheline…) │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_pcix_state │ PCI-X cap save buffer │ YES │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_msi_state │ live msi_desc list + │ no │
│ │ msi(x)_enabled │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_enable_acs │ re-derived from ACS │ no │
│ │ policy │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_iov_state │ live dev->sriov │ no │
│ │ (num_VFs, ctrl) │ │
└──────────────────────────┴────────────────────────┴─────────────────────┘
For things like MSI/X, SR-IOV, RE-BAR, etc. we're actually restoring
from the kernel internal state rather than the save buffer state, so
this is a no-op. However, one thing in that list stands out, TPH.
We don't yet support enabling TPH, but there are series on the list
that propose to add this. The TPH buffer space in the saved state is
allocated just by the capability being present. On open TPH is
disabled and the saved state is untouched, zeros. If TPH is then
enabled and the device reset, the pre-reset save state populates the
TPH save buffer and we restore that state post-reset. With the change
here, reset_done would then push the open saved state. The live TPH
state is enabled, therefore the restore pushes the original open state,
zeros.
This would result in a visible user change and maybe more importantly
shows that we're relying on ad-hoc behavior, without really any specific
policy to have this work reliably. It actually seems like only in the
close function, where we've disabled anything the user might have
enabled, is it really valid to restore the original state. Thanks,
Alex