Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RVV] Add qu8-gemm/qu8-igemm kernels for rvv #8105

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ken-unger
Copy link
Contributor

  • Adding these QU8 kernels for completeness, although I recognize that QU8 will be deprecated at some point.
  • Small change to qs8-gemm/rvv.in for qu8 support, however qs8/qd8 generated kernels remain unchanged.
  • Unit tests pass. I used a local fix for GEMM/IGEMM tests not working as expected for RVV? #8096, as described there, to properly execute the tests, however I decided not to include that fix here as that should really require re-execution of all gemm/igemm tests for all archs.
  • Performance of 4x4v and 7x4v is roughly equivalent on the test platform (K1) at ~15 G/s and so I've used 1x4v and 4x4v for the production config.
  • In future I'll take a look at why for RVV the generated gemm/igemm tests use the duplicate (mostly) CreateTests2 as its not clear to me why that is required.

@dsharlet and @fbarchard please review when you are able. Thank you.

@fbarchard
Copy link
Collaborator

looks good overall.

you call this m4 but store is m1? in gemm config you set NR to 4 * hardware_config->vlenb / sizeof(int32_t);
The elements are 8 bit and you store with m1, so it seems like it should be nr = hardware_config->vlenb / sizeof(uint8_t);

4x4 - typically on large convolutions it helps to be taller... the overread of reading left side as bytes is amortized by applying the same weights to them.
I suggest running gemm benchmark and look for a large slow convolution and then filtering on that, find which gemm tile size is fastest.

the quantization you could do as uint8 like arm does. typically there are fewer registers as 8 bit than float and/or min/max may be faster as 8 bit instead of float.

the gemm-config doesnt specify a packw kernel? weird... we are using the reference code.
the 8 bit packing for qu8 would be similar to 'x8' which can produce qs8 or x8 which doesnt care about sign, and is used for f32_qc8w.
qu8 is not often used compared to qs8 and qd8 does have a packw kernel. But its using a strided load, which is slow.
If that comes up high on profiles, it may be worth an optimization pass.

As you see our qu8 requiring a kernel zero point, typically requires 16 bit implementation, which is not ideal.
Another way to do it is 8 bit multiply of kernel value, and 8 bit multiply of zero point. But the zero point only needs to be done for each input, not each weight, so if you support a wider NR, the zero pointer amortizes. Also you can keep another set of accumulators. This method lends itself to hardware 8 bit matrix multiply or dot product.
But it only helps qu8. qs8 doesnt have the kernel zero point.
On arm we convert the kernel values to 16 bit due to lanes only working on 16 bit values. But ideally multiply 8 bit weights by inputs to get 16 bit, then a widening accumulate.
arm supports padal which adds pairs of 16 bit, sums and converts to 32 bit, then accumulate with 32 bit accumulator. does riscv have a padal equivalent? If so the 16 bit to 32 bit could makes this into a c2 (KR=2) kernel, processing 2 blocks at a time.

qu8_gemm_config.minmax.igemm[XNN_MR_TO_INDEX(4)] = xnn_init_hmp_igemm_ukernel((xnn_igemm_ukernel_fn) xnn_qu8_igemm_minmax_fp32_ukernel_4x4v__rvv);
qu8_gemm_config.init.qu8 = xnn_init_qu8_conv_minmax_fp32_scalar_params;
qu8_gemm_config.mr = 4;
qu8_gemm_config.nr = 4 * hardware_config->vlenb / sizeof(int32_t);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qu8_gemm_config.nr = hardware_config->vlenb / sizeof(uint8_t);
The 4v kernel is m1 so NR = vlenb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I realized part way through this that I really should have changed the naming here to be 1v rather than 4v as you say. But I also realized that I had done similar for the other Q gemm/igemm kernels ... so I probably should go back and fix those in my next pass. Thanks for all the other notes above, I hope to get back to this in a couple of days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants