Re: [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events

From: Dan Williams (nvidia)

Date: Mon May 11 2026 - 19:29:12 EST


Bowman, Terry wrote:
> On 5/8/2026 10:49 PM, Dan Williams (nvidia) wrote:
> > Jonathan Cameron wrote:
> >> On Thu, 7 May 2026 13:33:45 -0500
> >> "Bowman, Terry" <terry.bowman@xxxxxxx> wrote:
> > [..]
> >>>> This concerns me (sorry I wasn't paying attention to the v16 thread).
> >>>> It is a userspace regression against code that is out in the wild and typically
> >>>> not updated in sync with the kernel.
> >>>>
> >>>> If you are suggesting breaking ras-daemon at the very least +CC the maintainer.
> >
> > Sorry, that was not the intent, see below.
> >
> >>>>
> >>>> To get to a unified tracepoint add a new one that does what you want, but
> >>>> maintain the existing ones as well. Userspace can then migrate and maybe
> >>>> in 5+ years time we can delete the non unified ones.
> >>>>
> >>>> No actually comments on the code, just left it all here for Mauro,
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jonathan
> >>>>
> >>>
> >>> Dan was clear about using a single set of CE and UE handlers for all CXL RAS
> >>> protocol errors. While I understand there may be concerns, please direct any
> >>> objections to Dan and clarify what changes are required to avoid this
> >>> repeatedly going back and forth.
> >>>
> >>> [1] https://lore.kernel.org/linux-cxl/69cb2d5ba3111_178904100b7@dwillia2-mobl4.notmuch/
> >>
> >> Sure - Dan's on this thread so I'm sure he'll see it sooner or later.
> >>
> >> Perhaps I'm missing something that makes this less critical than it appears.
> >
> > No, it is breakage and a thinko on my part on the advice to Terry on the
> > backwards compatibility rules for tracepoints. At the time I was only
> > tracking data type and order of the payload. I.e. string at same
> > position. However, the name of the argument is ABI.
> >
> > Something like this incremental fixup I think gets this back on track.
> > It keeps legacy ABI support for "memdev" field in the payload. It
> > incrementally lets updated userspace understand "port" and "dport"
> > events. It stops us from growing a new set of events just to update the
> > arguments. It enhances the CPER events to now handle switch ports in
> > addition to endpoint ports.
> >
> > The bulk of the change is passing @port and @dport to the CXL trace
> > events instead of a plain @dev.
> >
>
> Thanks Dan and Jonathan,
>
> I have a few questions.
>
> Does this miss logging the Upstream SwitchPort device errors? Add another
> entry "uport=$"?
>
> How does the user know which of the devices (memdev, port, or dport) is the
> erroring device? Do the traces need another string variable inidicating which
> device triggered the error?

I expect that can be determined from what values get populated.

Endpoint:
memdev=memX port=endpointY dport= host=parent(memX)

Downstream:
memdev= port=portX dport=dport_dev(dportY) host=uport_dev(portX)

Upstream:
memdev= port=portX dport= host=uport_dev(portX)

If dport= is populated, that is the device that triggered the error,
otherwise it is the host= value.

> And, I need to confirm: the Endpoint is NULL unless the CXL Port is an Endpoint
> Port?

You mean memdev is empty, right?