[RVV] Add qu8-gemm/qu8-igemm kernels for rvv #8105

ken-unger · 2025-03-23T03:19:33Z

Adding these QU8 kernels for completeness, although I recognize that QU8 will be deprecated at some point.
Small change to qs8-gemm/rvv.in for qu8 support, however qs8/qd8 generated kernels remain unchanged.
Unit tests pass. I used a local fix for GEMM/IGEMM tests not working as expected for RVV? #8096, as described there, to properly execute the tests, however I decided not to include that fix here as that should really require re-execution of all gemm/igemm tests for all archs.
Performance of 4x4v and 7x4v is roughly equivalent on the test platform (K1) at ~15 G/s and so I've used 1x4v and 4x4v for the production config.
In future I'll take a look at why for RVV the generated gemm/igemm tests use the duplicate (mostly) CreateTests2 as its not clear to me why that is required.

@dsharlet and @fbarchard please review when you are able. Thank you.

fbarchard · 2025-03-24T22:12:18Z

looks good overall.

you call this m4 but store is m1? in gemm config you set NR to 4 * hardware_config->vlenb / sizeof(int32_t);
The elements are 8 bit and you store with m1, so it seems like it should be nr = hardware_config->vlenb / sizeof(uint8_t);

4x4 - typically on large convolutions it helps to be taller... the overread of reading left side as bytes is amortized by applying the same weights to them.
I suggest running gemm benchmark and look for a large slow convolution and then filtering on that, find which gemm tile size is fastest.

the quantization you could do as uint8 like arm does. typically there are fewer registers as 8 bit than float and/or min/max may be faster as 8 bit instead of float.

the gemm-config doesnt specify a packw kernel? weird... we are using the reference code.
the 8 bit packing for qu8 would be similar to 'x8' which can produce qs8 or x8 which doesnt care about sign, and is used for f32_qc8w.
qu8 is not often used compared to qs8 and qd8 does have a packw kernel. But its using a strided load, which is slow.
If that comes up high on profiles, it may be worth an optimization pass.

As you see our qu8 requiring a kernel zero point, typically requires 16 bit implementation, which is not ideal.
Another way to do it is 8 bit multiply of kernel value, and 8 bit multiply of zero point. But the zero point only needs to be done for each input, not each weight, so if you support a wider NR, the zero pointer amortizes. Also you can keep another set of accumulators. This method lends itself to hardware 8 bit matrix multiply or dot product.
But it only helps qu8. qs8 doesnt have the kernel zero point.
On arm we convert the kernel values to 16 bit due to lanes only working on 16 bit values. But ideally multiply 8 bit weights by inputs to get 16 bit, then a widening accumulate.
arm supports padal which adds pairs of 16 bit, sums and converts to 32 bit, then accumulate with 32 bit accumulator. does riscv have a padal equivalent? If so the 16 bit to 32 bit could makes this into a c2 (KR=2) kernel, processing 2 blocks at a time.

fbarchard · 2025-03-24T22:15:02Z

src/configs/gemm-config.c

+    qu8_gemm_config.minmax.igemm[XNN_MR_TO_INDEX(4)] = xnn_init_hmp_igemm_ukernel((xnn_igemm_ukernel_fn) xnn_qu8_igemm_minmax_fp32_ukernel_4x4v__rvv);
+    qu8_gemm_config.init.qu8 = xnn_init_qu8_conv_minmax_fp32_scalar_params;
+    qu8_gemm_config.mr = 4;
+    qu8_gemm_config.nr = 4 * hardware_config->vlenb / sizeof(int32_t);


qu8_gemm_config.nr = hardware_config->vlenb / sizeof(uint8_t);
The 4v kernel is m1 so NR = vlenb

Yes I realized part way through this that I really should have changed the naming here to be 1v rather than 4v as you say. But I also realized that I had done similar for the other Q gemm/igemm kernels ... so I probably should go back and fix those in my next pass. Thanks for all the other notes above, I hope to get back to this in a couple of days.

ken-unger added 2 commits March 19, 2025 21:51

qu8-gemm and qu8-igemm support

5981258

add generated files

80b54c4

fbarchard reviewed Mar 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RVV] Add qu8-gemm/qu8-igemm kernels for rvv #8105

[RVV] Add qu8-gemm/qu8-igemm kernels for rvv #8105

ken-unger commented Mar 23, 2025

fbarchard commented Mar 24, 2025

fbarchard Mar 24, 2025

ken-unger Mar 28, 2025

[RVV] Add qu8-gemm/qu8-igemm kernels for rvv #8105

Are you sure you want to change the base?

[RVV] Add qu8-gemm/qu8-igemm kernels for rvv #8105

Conversation

ken-unger commented Mar 23, 2025

fbarchard commented Mar 24, 2025

fbarchard Mar 24, 2025

Choose a reason for hiding this comment

ken-unger Mar 28, 2025

Choose a reason for hiding this comment