Re: [RFC PATCH 2/3] media: qcom: camss: Add CAMSS Offline Processing Engine driver

From: Loic Poulain

Date: Mon Mar 23 2026 - 11:53:07 EST

Hi Bryan,

On Mon, Mar 23, 2026 at 2:43 PM Bryan O'Donoghue <bod@xxxxxxxxxx> wrote:
>
> On 23/03/2026 12:58, Loic Poulain wrote:
> > Provide a initial implementation for the Qualcomm Offline Processing
> > Engine (OPE). OPE is a memory-to-memory hardware block designed for
> > image processing on a source frame. Typically, the input frame
> > originates from the SoC CSI capture path, though not limited to.
> >
> > The hardware architecture consists of Fetch Engines and Write Engines,
> > connected through intermediate pipeline modules:
> > [FETCH ENGINES] => [Pipeline Modules] => [WRITE ENGINES]
> >
> > Current Configuration:
> > Fetch Engine: One fetch engine is used for Bayer frame input.
> > Write Engines: Two display write engines for Y and UV planes output.
> >
> > Enabled Pipeline Modules:
> > CLC_WB: White balance (channel gain configuration)
> > CLC_DEMO: Demosaic (Bayer to RGB conversion)
> > CLC_CHROMA_ENHAN: RGB to YUV conversion
> > CLC_DOWNSCALE*: Downscaling for UV and Y planes
> >
> > Default configuration values are based on public standards such as BT.601.
> >
> > Processing Model:
> > OPE processes frames in stripes of up to 336 pixels. Therefore, frames must
> > be split into stripes for processing. Each stripe is configured after the
> > previous one has been acquired (double buffered registers). To minimize
> > inter-stripe latency, stripe configurations are generated ahead of time.
>
> A yavata command set showing usage would be appreciated.

AFAIK, yavta does not (yet) support M2M devices, but I can probably
use an other tool.

>
> >
> > Signed-off-by: Loic Poulain <loic.poulain@xxxxxxxxxxxxxxxx>
> > ---
> > drivers/media/platform/qcom/camss/Makefile | 4 +
> > drivers/media/platform/qcom/camss/camss-ope.c | 2058 +++++++++++++++++
> > 2 files changed, 2062 insertions(+)
> > create mode 100644 drivers/media/platform/qcom/camss/camss-ope.c
> >
> > diff --git a/drivers/media/platform/qcom/camss/Makefile b/drivers/media/platform/qcom/camss/Makefile
> > index 5e349b491513..67f261ae0855 100644
> > --- a/drivers/media/platform/qcom/camss/Makefile
> > +++ b/drivers/media/platform/qcom/camss/Makefile
> > @@ -29,3 +29,7 @@ qcom-camss-objs += \
> > camss-format.o \
> >
> > obj-$(CONFIG_VIDEO_QCOM_CAMSS) += qcom-camss.o
> > +
> > +qcom-camss-ope-objs += camss-ope.o
> > +
> > +obj-$(CONFIG_VIDEO_QCOM_CAMSS) += qcom-camss-ope.o
>
> Needs a Kconfig entry.

ack.

> > +
> > +#define OPE_RESET_TIMEOUT_MS 100
> > +
> > +/* Expected framerate for power scaling */
> > +#define DEFAULT_FRAMERATE 60
> > +
> > +/* Downscaler helpers */
> > +#define Q21(v) (((uint64_t)(v)) << 21)
> > +#define DS_Q21(n, d) ((uint32_t)(((uint64_t)(n) << 21) / (d)))
>
> u64 and u32 here.

ok.

> > +
> > +static inline char *print_fourcc(u32 fmt)
> > +{
> > + static char code[5];
> > +
> > + code[0] = (unsigned char)(fmt & 0xff);
> > + code[1] = (unsigned char)((fmt >> 8) & 0xff);
> > + code[2] = (unsigned char)((fmt >> 16) & 0xff);
> > + code[3] = (unsigned char)((fmt >> 24) & 0xff);
> > + code[4] = '\0';
> > +
> > + return code;
> > +}
>
> This is a bug

Indeed, I will use %p4cc as you recommended in a similar series.

> > +
> > +static void ope_prog_bayer2rgb(struct ope_dev *ope)
> > +{
> > + /* Fixed Settings */
> > + ope_write_pp(ope, 0x860, 0x4001);
> > + ope_write_pp(ope, 0x868, 128);
> > + ope_write_pp(ope, 0x86c, 128 << 20);
> > + ope_write_pp(ope, 0x870, 102);
>
> What are the magic numbers about ? Please define bit-fields and offsets.

There are some registers I can't disclose today, which have to be
configured with working values,
Similarly to some sensor configuration in media/i2c.

> Parameters passed in from user-space/libcamera and then translated to
> registers etc.

The above fixed settings will not be part of the initial parameters.

>
> > +}
> > +
> > +static void ope_prog_wb(struct ope_dev *ope)
> > +{
> > + /* Default white balance config */
> > + u32 g_gain = OPE_WB(1, 1);
> > + u32 b_gain = OPE_WB(3, 2);
> > + u32 r_gain = OPE_WB(3, 2);
> > +
> > + ope_write_pp(ope, OPE_PP_CLC_WB_GAIN_WB_CFG(0), g_gain);
> > + ope_write_pp(ope, OPE_PP_CLC_WB_GAIN_WB_CFG(1), b_gain);
> > + ope_write_pp(ope, OPE_PP_CLC_WB_GAIN_WB_CFG(2), r_gain);
> > +
> > + ope_write_pp(ope, OPE_PP_CLC_WB_GAIN_MODULE_CFG, OPE_PP_CLC_WB_GAIN_MODULE_CFG_EN);
> > +}
>
> Fixed gains will have to come from real data.

These gains will indeed need to be configurable, most likely via ISP
parameters, here, they have been adjusted based on colorbar test
pattern from imx219 sensors but also tested with real capture.

>
> > +
> > +static void ope_prog_stripe(struct ope_ctx *ctx, struct ope_stripe *stripe)
> > +{
> > + struct ope_dev *ope = ctx->ope;
> > + int i;
> > +
> > + dev_dbg(ope->dev, "Context %p - Programming S%u\n", ctx, ope_stripe_index(ctx, stripe));
> > +
> > + /* Fetch Engine */
> > + ope_write_rd(ope, OPE_BUS_RD_CLIENT_0_UNPACK_CFG_0, stripe->src.format);
> > + ope_write_rd(ope, OPE_BUS_RD_CLIENT_0_RD_BUFFER_SIZE,
> > + (stripe->src.width << 16) + stripe->src.height);
> > + ope_write_rd(ope, OPE_BUS_RD_CLIENT_0_ADDR_IMAGE, stripe->src.addr);
> > + ope_write_rd(ope, OPE_BUS_RD_CLIENT_0_RD_STRIDE, stripe->src.stride);
> > + ope_write_rd(ope, OPE_BUS_RD_CLIENT_0_CCIF_META_DATA,
> > + FIELD_PREP(OPE_BUS_RD_CLIENT_0_CCIF_MD_PIX_PATTERN, stripe->src.pattern));
> > + ope_write_rd(ope, OPE_BUS_RD_CLIENT_0_CORE_CFG, OPE_BUS_RD_CLIENT_0_CORE_CFG_EN);
> > +
> > + /* Write Engines */
> > + for (i = 0; i < OPE_WR_CLIENT_MAX; i++) {
> > + if (!stripe->dst[i].enabled) {
> > + ope_write_wr(ope, OPE_BUS_WR_CLIENT_CFG(i), 0);
> > + continue;
> > + }
> > +
> > + ope_write_wr(ope, OPE_BUS_WR_CLIENT_ADDR_IMAGE(i), stripe->dst[i].addr);
> > + ope_write_wr(ope, OPE_BUS_WR_CLIENT_IMAGE_CFG_0(i),
> > + (stripe->dst[i].height << 16) + stripe->dst[i].width);
> > + ope_write_wr(ope, OPE_BUS_WR_CLIENT_IMAGE_CFG_1(i), stripe->dst[i].x_init);
> > + ope_write_wr(ope, OPE_BUS_WR_CLIENT_IMAGE_CFG_2(i), stripe->dst[i].stride);
> > + ope_write_wr(ope, OPE_BUS_WR_CLIENT_PACKER_CFG(i), stripe->dst[i].format);
> > + ope_write_wr(ope, OPE_BUS_WR_CLIENT_CFG(i),
> > + OPE_BUS_WR_CLIENT_CFG_EN + OPE_BUS_WR_CLIENT_CFG_AUTORECOVER);
> > + }
> > +
> > + /* Downscalers */
> > + for (i = 0; i < OPE_DS_MAX; i++) {
> > + struct ope_dsc_config *dsc = &stripe->dsc[i];
> > + u32 base = ope_ds_base[i];
> > + u32 cfg = 0;
> > +
> > + if (dsc->input_width != dsc->output_width) {
> > + dsc->phase_step_h |= DS_RESOLUTION(dsc->input_width,
> > + dsc->output_width) << 30;
> > + cfg |= OPE_PP_CLC_DOWNSCALE_MN_DS_CFG_H_SCALE_EN;
> > + }
> > +
> > + if (dsc->input_height != dsc->output_height) {
> > + dsc->phase_step_v |= DS_RESOLUTION(dsc->input_height,
> > + dsc->output_height) << 30;
> > + cfg |= OPE_PP_CLC_DOWNSCALE_MN_DS_CFG_V_SCALE_EN;
> > + }
> > +
> > + ope_write_pp(ope, OPE_PP_CLC_DOWNSCALE_MN_DS_CFG(base), cfg);
> > + ope_write_pp(ope, OPE_PP_CLC_DOWNSCALE_MN_DS_IMAGE_SIZE_CFG(base),
> > + ((dsc->input_width - 1) << 16) + dsc->input_height - 1);
> > + ope_write_pp(ope, OPE_PP_CLC_DOWNSCALE_MN_DS_MN_H_CFG(base), dsc->phase_step_h);
> > + ope_write_pp(ope, OPE_PP_CLC_DOWNSCALE_MN_DS_MN_V_CFG(base), dsc->phase_step_v);
> > + ope_write_pp(ope, OPE_PP_CLC_DOWNSCALE_MN_CFG(base),
> > + cfg ? OPE_PP_CLC_DOWNSCALE_MN_CFG_EN : 0);
> > + }
> > +}
>
> So - this is where the CDM should be used - so that you don't have to do
> all of these MMIO writes inside of your ISR.

Indeed, and that also the reason stripes are computed ahead of time,
so that they can be further 'queued' in a CDM.

>
> Is that and additional step after the RFC ?

The current implementation (without CDM) already provides good results
and performance, so CDM can be viewed as a future enhancement.
As far as I understand, CDM could also be implemented in a generic way
within CAMSS, since other CAMSS blocks make use of CDM as well.
This is something we should discuss further.

>
> > +
> > +/*
> > + * mem2mem callbacks
> > + */
> > +static void ope_device_run(void *priv)
> > +{
> > + struct vb2_v4l2_buffer *src_buf, *dst_buf;
> > + struct ope_ctx *ctx = priv;
> > + struct ope_dev *ope = ctx->ope;
> > + dma_addr_t src, dst;
> > +
> > + dev_dbg(ope->dev, "Start context %p", ctx);
> > +
> > + src_buf = v4l2_m2m_next_src_buf(ctx->fh.m2m_ctx);
> > + if (!src_buf)
> > + return;
> > +
> > + dst_buf = v4l2_m2m_next_dst_buf(ctx->fh.m2m_ctx);
> > + if (!dst_buf)
> > + return;
> > +
> > + src = vb2_dma_contig_plane_dma_addr(&src_buf->vb2_buf, 0);
> > + dst = vb2_dma_contig_plane_dma_addr(&dst_buf->vb2_buf, 0);
> > +
> > + /* Generate stripes from full frame */
> > + ope_gen_stripes(ctx, src, dst);
> > +
> > + if (priv != ope->context) {
> > + /* If context changed, reprogram the submodules */
> > + ope_prog_wb(ope);
> > + ope_prog_bayer2rgb(ope);
> > + ope_prog_rgb2yuv(ope);
> > + ope->context = priv;
> > + }
> > +
> > + /* Program the first stripe */
> > + ope_prog_stripe(ctx, &ctx->stripe[0]);
> > +
> > + /* Go! */
> > + ope_start(ope);
> > +}
> > +
> > +static void ope_job_done(struct ope_ctx *ctx, enum vb2_buffer_state vbstate)
> > +{
> > + struct vb2_v4l2_buffer *src, *dst;
> > +
> > + if (!ctx)
> > + return;
> > +
> > + src = v4l2_m2m_src_buf_remove(ctx->fh.m2m_ctx);
> > + dst = v4l2_m2m_dst_buf_remove(ctx->fh.m2m_ctx);
> > +
> > + if (dst && src)
> > + dst->vb2_buf.timestamp = src->vb2_buf.timestamp;
> > +
> > + if (src)
> > + v4l2_m2m_buf_done(src, vbstate);
> > + if (dst)
> > + v4l2_m2m_buf_done(dst, vbstate);
> > +
> > + v4l2_m2m_job_finish(ctx->ope->m2m_dev, ctx->fh.m2m_ctx);
> > +}
> > +
> > +static void ope_buf_done(struct ope_ctx *ctx)
> > +{
> > + struct ope_stripe *stripe = ope_current_stripe(ctx);
> > +
> > + if (!ctx)
> > + return;
> > +
> > + dev_dbg(ctx->ope->dev, "Context %p Stripe %u done\n",
> > + ctx, ope_stripe_index(ctx, stripe));
> > +
> > + if (ope_stripe_is_last(stripe)) {
> > + ctx->current_stripe = 0;
> > + ope_job_done(ctx, VB2_BUF_STATE_DONE);
> > + } else {
> > + ctx->current_stripe++;
> > + ope_start(ctx->ope);
> > + }
> > +}
> > +
> > +static void ope_job_abort(void *priv)
> > +{
> > + struct ope_ctx *ctx = priv;
> > +
> > + /* reset to abort */
> > + ope_write(ctx->ope, OPE_TOP_RESET_CMD, OPE_TOP_RESET_CMD_SW);
> > +}
>
> Shoudln't this wait for ope_job_done() ?

No, according to v4l2-mem2mem.h:
Informs the driver that it has to abort the currently
running transaction as soon as possible
[...]
This function does not have to (and will usually not) wait
until the device enters a state when it can be stopped.

> > +static irqreturn_t ope_irq(int irq, void *dev_id)
> > +{
> > + struct ope_dev *ope = dev_id;
> > + struct ope_ctx *ctx = ope->m2m_dev ? v4l2_m2m_get_curr_priv(ope->m2m_dev) : NULL;
>
> You have a mutex for this pointer but it doesn't seem to be in-use here
>
> Should this be a threadded IRQ with reference to that mutex then ?

We currently rely on the mem2mem framework to manage context
concurrency, and in particular to ensure that a context cannot be
released while an ope_job_done callback is still pending. This avoids
blocking on the global OPE mutex, which may be held for unrelated
operations such as creating another context.
However, there may still be unsafe paths, so an additional per-context
lock might be worth introducing.

> > +static int ope_enum_frameintervals(struct file *file, void *fh,
> > + struct v4l2_frmivalenum *fival)
> > +{
> > + fival->type = V4L2_FRMIVAL_TYPE_STEPWISE;
> > + fival->stepwise.min.numerator = 1;
> > + fival->stepwise.min.denominator = 120;
> > + fival->stepwise.max.numerator = 1;
> > + fival->stepwise.max.denominator = 1;
> > + fival->stepwise.step.numerator = 1;
> > + fival->stepwise.step.denominator = 1;
>
> fival->index should return -EINVAL for index > 0
>
> should also valid width and height and pixel format

Thanks, I will add that.

>
> > + return 0;
> > +}
> > +
> > +static int ope_s_ctrl(struct v4l2_ctrl *ctrl)
> > +{
> > + return -EINVAL;
> > +}
> > +
> > +static const struct v4l2_ctrl_ops ope_ctrl_ops = {
> > + .s_ctrl = ope_s_ctrl,
> > +};
>
> Eh - I think you can drop this ..

Indeed.

>
[...]