Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation

From: Christoph Hellwig

Date: Tue Mar 24 2026 - 03:49:47 EST

Hi Damian,

thanks for looking into this.

I've added the arm64 maintainers and arm list as that's your best bet
for someone actually understanding the low-level assembly code.

On Wed, Mar 18, 2026 at 03:02:45PM +0000, Demian Shulhan wrote:
> Note that for the recovery path (`xor_syndrome`), NEON may still be
> selected dynamically by the algorithm benchmark, as the recovery
> workload is heavily memory-bound.

The recovery side has no benchmarking, you need to manually select
a priority.

Note that I just sent out a "cleanup the RAID6 P/Q library" series that
make this more explicit. It also make it clear by prioritizing
implementations using better instructions available we can short-cut
the generation side probing path a lot, which might be worth looking
into for this.

I'm also curious if you looked why the 4x unroll is slower than
the lesser unroll, and if that is inherent in the implementation. Or
just an effect of the small number of disks in that we don't actually
have 4 disks to unroll for every other iteration. I.e. what would the
numbers be if RAID6_TEST_DISKS was increased to 10 or 18?

I plan into potentially select the unrolling variants based on the
number of "disks" to calculate over as a follow-on.

We'll have to wait for review on my series, but I'd love to just rebase
this ontop if possible. I can offer to do the work, but I'd need to
run it past you for testing and final review.

> +static void raid6_sve1_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs)

Overly long line.

> +{
> + u8 **dptr = (u8 **)ptrs;
> + u8 *p, *q;
> + long z0 = disks - 3;
> +
> + p = dptr[z0 + 1];
> + q = dptr[z0 + 2];

I know all this is derived from existing code, but as I started to hate
that I'll add my cosmetic comments:

This would read nicer by initializing at declaration time:

u8 **dptr = (u8 **)ptrs;
long z0 = disks - 3;
u8 *p = dptr[z0 + 1];
u8 *q = dptr[z0 + 2];

Also z0 might better be named z_last or last_disk, or stop as in the
_xor variant routines.

> + asm volatile(

But I wonder if just implementing the entire routine as assembly in a
.S file would make more sense than this anyway?

> +static void raid6_sve1_xor_syndrome_real(int disks, int start, int stop,
> + unsigned long bytes, void **ptrs)
> +{
> + u8 **dptr = (u8 **)ptrs;
> + u8 *p, *q;
> + long z0 = stop;
> +
> + p = dptr[disks - 2];
> + q = dptr[disks - 1];
> +
> + asm volatile(

Same comments here, plus just dropping z0 vs using stop directly.

> +#define RAID6_SVE_WRAPPER(_n) \
> + static void raid6_sve ## _n ## _gen_syndrome(int disks, \
> + size_t bytes, void **ptrs) \
> + { \
> + scoped_ksimd() \
> + raid6_sve ## _n ## _gen_syndrome_real(disks, \
> + (unsigned long)bytes, ptrs); \

Missing indentation after the scoped_ksimd(). A lot of other code uses
separate compilation units for the SIMD code, which seems pretty useful
to avoid mixing SIMD with non-SIMD code. That would also combine nicely
with the suggestion above to implement the low-level routines entirely
in assembly.

I'll leave comments on the actual assembly details to people that
actually know ARM64 assembly.