Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap

From: Nicolin Chen

Date: Wed Mar 18 2026 - 23:12:33 EST


On Thu, Mar 19, 2026 at 03:08:05AM +0000, Tian, Kevin wrote:
> > From: Samiullah Khawaja <skhawaja@xxxxxxxxxx>
> > Sent: Thursday, March 19, 2026 6:07 AM
> >
> > Hi Nicolin,
> >
> > On Wed, Mar 18, 2026 at 12:26:33PM -0700, Nicolin Chen wrote:
> > >On Wed, Mar 18, 2026 at 07:36:20AM +0000, Tian, Kevin wrote:
> > >> > From: Nicolin Chen <nicolinc@xxxxxxxxxx>
> > >> > Sent: Wednesday, March 18, 2026 3:16 AM
> > >> >
> > >> > An ATC invalidation timeout is a fatal error. While the SMMUv3
> > hardware is
> > >> > aware of the timeout via a GERROR interrupt, the driver thread issuing
> > the
> > >> > commands lacks a direct mechanism to verify whether its specific batch
> > was
> > >> > the cause or not, as polling the CMD_SYNC status doesn't natively return
> > a
> > >> > failure code, making it very difficult to coordinate per-device recovery.
> > >> >
> > >> > Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge
> > this
> > >> > gap. When the ISR detects an ATC timeout, set the bit corresponding to
> > the
> > >> > physical CMDQ index of the faulting CMD_SYNC command.
> > >> >
> > >>
> > >> It's nice to see the ability of allowing sw to identify the faulting sync
> > command
> > >> upon an ATC timeout! On VT-d it's not feasible when multiple wait
> > descriptors
> > >> (similar to CMD_SYNC) are in-fly... :/
> > >
> > >Actually SMMU doesn't know which device is faulting when CMD_SYNC
> >
> > VT-d is able to find out the SID of the device for which the device TLB
> > invalidation timed-out occured by using the SID reported in the
> > "Invalidation Queue Error Record Register" (VT-d Specs 11.4.9.9).
>
> yes. but when there are multiple submissions (each with a wait descriptor)
> fetched/handled by the hw and then an invalidation timeout comes, all
> pending wait descriptors will be aborted (not just the one corresponding
> to the timeout). In this case all affected submitters need to re-try.

This sounds similar to SMMU then.

Nicolin