Re: [PATCH net-next v2 1/2] net/smc: transition to RDMA core CQ pooling

From: D. Wythe

Date: Fri Jun 05 2026 - 00:04:29 EST


On Thu, Jun 04, 2026 at 10:36:37AM +0200, Paolo Abeni wrote:
> From: AI Reviewer <ai@xxxxxxxxxxx>
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev/
> ---
> net/smc: transition to RDMA core CQ pooling
>
> The current SMC-R implementation relies on global per-device CQs
> and manual polling within tasklets, which introduces severe
> scalability bottlenecks due to global lock contention and tasklet
> scheduling overhead, resulting in poor performance as concurrency
> increases.
>
> Refactor the completion handling to utilize the ib_cqe API and
> standard RDMA core CQ pooling.
>
> > diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
> > index 9bb495707445e..a6d6b81830d38 100644
> > --- a/net/smc/smc_ib.c
> > +++ b/net/smc/smc_ib.c
>
> [ ... ]
>
> > @@ -657,38 +645,59 @@ void smc_ib_destroy_queue_pair(struct smc_link *lnk)
> > if (lnk->roce_qp)
> > ib_destroy_qp(lnk->roce_qp);
> > lnk->roce_qp = NULL;
> > + if (lnk->ib_cq) {
> > + ib_cq_pool_put(lnk->ib_cq, lnk->nr_cqe);
> > + lnk->ib_cq = NULL;
> > + }
> > }
> >
> > /* create a queue pair within the protection domain for a link */
> > int smc_ib_create_queue_pair(struct smc_link *lnk)
> > {
> > + int max_send_wr, max_recv_wr, rc;
> > + struct ib_cq *cq;
> > +
> > + /* include unsolicited rdma_writes as well,
> > + * there are max. 2 RDMA_WRITE per 1 WR_SEND.
> > + */
> > + max_send_wr = 3 * lnk->lgr->max_send_wr + 1; /* +1 for ib_drain_sq() */
>
> [Severity: High]
> Does this calculation account for all possible work requests?
>
> To support ib_drain_sq(), this patch increases SQ capacity by exactly 1,
> assuming standard data transmissions consume up to 3 WRs each. However, SMC
> also posts FastReg WRs (smc_wr_reg_send) and SMC-Rv2 LLC control messages
> (smc_wr_tx_v2_send) to the same SQ.
>
> If these independent WRs are posted while data TX slots are saturated, the
> total in-flight WRs will consume the remaining +1 slot reserved for the drain
> WR.
>
> When teardown occurs, ib_drain_sq() will attempt to post its dummy WR, but
> ib_post_send() will fail with -ENOMEM because the SQ is full. ib_drain_sq()
> responds to this failure by returning immediately without waiting for
> completions, unblocking teardown and freeing link structures while the
> hardware is still asynchronously flushing the SQ. When the softirq polls the
> flushed CQEs, it will access the freed ib_cqe pointers, leading to a
> use-after-free.
>

Good catch. The +1 is indeed not enough.

The SQ has three independent producers but only the data path (3*N) was counted.
FastReg (link->wr_reg) and the SMC-Rv2 v2 SEND (link->wr_tx_v2_ib) are both link-unique,
so up to 3N + 2 WRs can be in flight when __ib_drain_sq() posts its drain WR.

With only +1 the __ib_drain_sq() may return without draining, and teardown frees the link
while WRs are still in flight.

The undersizing itself is pre-existing — 3*N never reserved room for FastReg/v2,
but it was harmless before. This series only turns it fatal by moving to ib_drain_qp(),
which actually posts a WR. Since it was benign on its own, I'll just fix it
here in the next revision rather than as a separate patch, bumping the
reservation to +3 (drain + FastReg + v2).

> > + max_recv_wr = lnk->lgr->max_recv_wr + 1; /* +1 for ib_drain_rq() */
> > +
> > + cq = ib_cq_pool_get(lnk->smcibdev->ibdev, max_send_wr + max_recv_wr, -1,
> > + IB_POLL_SOFTIRQ);
>
> [ ... ]
>
> > diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> > index 59c92b46945c5..130bc6c26fb31 100644
> > --- a/net/smc/smc_wr.c
> > +++ b/net/smc/smc_wr.c
>
> [ ... ]
>
> > @@ -63,109 +60,67 @@ void smc_wr_tx_wait_no_pending_sends(struct smc_link *link)
>
> [ ... ]
>
> > if (pnd_snd_idx == link->wr_tx_cnt) {
> > - if (link->lgr->smc_version != SMC_V2 ||
> > - link->wr_tx_v2_pend->wr_id != wc->wr_id)
> > - return;
> > - link->wr_tx_v2_pend->wc_status = wc->status;
> > - memcpy(&pnd_snd, link->wr_tx_v2_pend, sizeof(pnd_snd));
> > - /* clear the full struct smc_wr_tx_pend including .priv */
> > - memset(link->wr_tx_v2_pend, 0,
> > - sizeof(*link->wr_tx_v2_pend));
> > memset(link->lgr->wr_tx_buf_v2, 0,
> > sizeof(*link->lgr->wr_tx_buf_v2));
>
> [Severity: High]
> This is a pre-existing issue, but while reviewing this code I noticed a
> potential memory leak and NULL pointer dereference in smc_wr_tx_put_slot().
>
> In the error path of smc_wr_tx_put_slot(), the code attempts to clear the V2
> pending slots:
>
> net/smc/smc_wr.c:smc_wr_tx_put_slot() {
> ...
> } else if (link->lgr->smc_version == SMC_V2 &&
> pend->idx == link->wr_tx_cnt) {
> /* Large v2 buffer */
> memset(&link->wr_tx_v2_pend, 0,
> sizeof(link->wr_tx_v2_pend));
> memset(&link->lgr->wr_tx_buf_v2, 0,
> sizeof(link->lgr->wr_tx_buf_v2));
> ...
> }
>
> Because these target variables are pointers, sizeof evaluates to 8 bytes. The
> memset() call zeroes the pointer variable itself rather than clearing the
> underlying structure, which leaks the allocated memory.
>
> If smc_wr_tx_get_v2_slot() is called subsequently, it will attempt to
> dereference the now-NULL link->wr_tx_v2_pend, leading to a direct NULL pointer
> dereference.
>
> Could this be updated to use sizeof(*link->wr_tx_v2_pend) similar to the code
> removed above?
>
> [ ... ]
>

Makes sense. Since this is pre-existing and unrelated to this series,
I'll send it as a separate patch.

> > @@ -431,94 +367,74 @@ static inline void smc_wr_rx_demultiplex(struct ib_wc *wc)
> > {
>
> [ ... ]
>
> > }
> >
> > -static inline void smc_wr_rx_process_cqes(struct ib_wc wc[], int num)
> > +static void smc_wr_rx_process_cqe(struct ib_cq *cq, struct ib_wc *wc)
> > {
> > - struct smc_link *link;
> > - int i;
> > + struct smc_link *link = wc->qp->qp_context;
> >
> > - for (i = 0; i < num; i++) {
> > - link = wc[i].qp->qp_context;
> > - link->wr_rx_id_compl = wc[i].wr_id;
> > - if (wc[i].status == IB_WC_SUCCESS) {
> > - link->wr_rx_tstamp = jiffies;
> > - smc_wr_rx_demultiplex(&wc[i]);
> > - smc_wr_rx_post(link); /* refill WR RX */
> > - } else {
> > - /* handle status errors */
> > - switch (wc[i].status) {
> > - case IB_WC_RETRY_EXC_ERR:
> > - case IB_WC_RNR_RETRY_EXC_ERR:
> > - case IB_WC_WR_FLUSH_ERR:
> > - smcr_link_down_cond_sched(link);
> > - if (link->wr_rx_id_compl == link->wr_rx_id)
> > - wake_up(&link->wr_rx_empty_wait);
> > - break;
> > - default:
> > - smc_wr_rx_post(link); /* refill WR RX */
> > - break;
> > - }
> > + if (wc->status == IB_WC_SUCCESS) {
> > + link->wr_rx_tstamp = jiffies;
> > + smc_wr_rx_demultiplex(wc);
> > + smc_wr_rx_post(link, wc->wr_cqe); /* refill WR RX */
>
> [Severity: Critical]
> Can this cause a use-after-free when racing with queue teardown?
>
> The patch replaces smc_wr_drain_cq() with ib_drain_qp().
> ib_drain_rq() transitions the QP to IB_QPS_ERR and posts a dummy drain WR to
> wait for completions.
>
> However, smc_wr_rx_process_cqe() unconditionally reposts a new WR when
> processing a successful packet completion. If a valid RX completion is polled
> in the softirq concurrently with ib_drain_rq(), the softirq will post a new WR
> after the drain WR has been posted.
>
> Since the QP is in the ERR state, the new WR will be immediately completed
> with a flush error, but its CQE will be generated after the drain WR's CQE.
> This allows ib_drain_rq() to return and unblock link teardown, freeing
> link->wr_rx_ibs.
>
> When the softirq subsequently polls the flush CQE for the newly posted WR,
> it will dereference wc->wr_cqe->done. Since wc->wr_cqe points to the freed
> smc_ib_recv_wr, this results in a use-after-free.
>

The analysis is right, and unlike the other two this one is introduced by this
series. The old smc_wr_drain_cq() was a pure counting wait that posted
no WR, so the unconditional repost was harmless. By switching to
ib_drain_qp(), which relies on the drain WR being the last WR on the
queue, the same repost now breaks that invariant and opens the UAF
window.

I'll fix it in the next revision by quiescing the repost path
before draining, so the drain WR is guaranteed to be the last one.

D. Wythe

> > + } else {
> > + /* handle status errors */
> > + switch (wc->status) {
> > + case IB_WC_RETRY_EXC_ERR:
> > + case IB_WC_RNR_RETRY_EXC_ERR:
> > + case IB_WC_WR_FLUSH_ERR:
> > + smcr_link_down_cond_sched(link);
> > + break;
> > + default:
> > + smc_wr_rx_post(link, wc->wr_cqe); /* refill WR RX */
> > + break;
> > }
> > }
> > }
> --
> This is an AI-generated review.