-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement dynamic byte-swizzle prototype #334
Conversation
4f76022
to
6efe4e1
Compare
This is meant to be an example that is used to test a Rust intrinsic against, which will replace it. The interface is fairly direct and doesn't address more nuanced or interesting permutations one can do, nevermind on types other than bytes. The ultimate goal is for direct LLVM support for this.
6efe4e1
to
1ec7695
Compare
The API isn't perfect as we will want to figure out something so that, if we do indeed want to support this function beyond an N of 128~256, we can have this use 16-bit indices instead like RVV allows. However, that's not necessarily the biggest concern, and this is too important as a functionality to worry about "what if it needs two functions, one for each index type?" Then it needs two functions! We will endure. |
Are 16 bit indices as universal as 8 bit? |
{ | ||
/// Swizzle a vector of bytes according to the index vector. | ||
/// Indices within range select the appropriate byte. | ||
/// Indices "out of bounds" instead select 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a note that this really needs build-std
to work correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it kinda doesn't, does it? The vectors are generic, so this will get instantiated at compile time. What it needs is to be combined with target_feature
configuration, either dynamic multiversioning or compile-time versioning or whatnot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cfg
s will depend on the features std is built with, unfortunately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh good point hmmmmmm...
...I guess I could make this dynamically multiversioned lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That part, at least, will be fixed upon promoting this into an intrinsic.
Hmm. Honestly, 16-bit indices are only used, afaik, by RISC-V's Vector extension. So I think saying "nah, use a target intrinsic for that" would be fair. It's mostly "if we find a way to seamlessly transition to larger indices, that would be cool". |
afaict avx512 supports 16-bit indexes for SimpleV supports 8/16/32/64-bit indexes iirc. |
The AVX512 instruction, VPERMI2{W,D,Q} doesn't really matter to the abstract operation we're defining because that instruction uses indices that have the same size as the type, because it overwrites the index vector with the results (subject to the mask). And that's to be expected for AVX512F because without the AVX512BW extension, you can't do byte-level operations at all. But we don't really care about that because destructive update on the indices is a pretty unusual pattern, a form using that instruction should be yielded by combining it with a masked store (or the intrinsic, obv), and what is actually relevant for what we're doing is whether a u8 will be big enough. |
6d3ed39
to
3072258
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As funny as it would be to package a CPUID implementation to handle the which-AVX-version stuff, I think I am gonna skip it for now. I could have completely skipped implementing this "up here" at the "tip", but I wanted to have full testing in our suite against proptest, first, which is very good at finding counterfactuals.
// This is ordering sensitive, and LLVM will order these how you put them. | ||
// Most AVX2 impls use ~5 "ports", and only 1 or 2 are capable of permutes. | ||
// But the "compose" step will lower to ops that can also use at least 1 other port. | ||
// So this tries to break up permutes so composition flows through "open" ports. | ||
// Comparative benches should be done on multiple AVX2 CPUs before reordering this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having all this commentary here isn't strictly necessary but I'm going to transplant more-or-less the same remarks into rustc (and maybe into LLVM???) later, so writing this down matters.
This is meant to be an example that is used to test a Rust intrinsic against, which will replace it. The interface is fairly direct and doesn't address more nuanced or interesting permutations one can do, nevermind on types other than bytes.
The ultimate goal is for direct LLVM support for this.