Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dynamic byte-swizzle prototype #334

Merged
merged 2 commits into from
Apr 23, 2023

Conversation

workingjubilee
Copy link
Member

This is meant to be an example that is used to test a Rust intrinsic against, which will replace it. The interface is fairly direct and doesn't address more nuanced or interesting permutations one can do, nevermind on types other than bytes.

The ultimate goal is for direct LLVM support for this.

@workingjubilee workingjubilee force-pushed the implement-basic-dyn-swizzle branch 4 times, most recently from 4f76022 to 6efe4e1 Compare March 12, 2023 14:31
This is meant to be an example that is used to test
a Rust intrinsic against, which will replace it.
The interface is fairly direct and doesn't address
more nuanced or interesting permutations one can do,
nevermind on types other than bytes.

The ultimate goal is for direct LLVM support for this.
@workingjubilee
Copy link
Member Author

The API isn't perfect as we will want to figure out something so that, if we do indeed want to support this function beyond an N of 128~256, we can have this use 16-bit indices instead like RVV allows.

However, that's not necessarily the biggest concern, and this is too important as a functionality to worry about "what if it needs two functions, one for each index type?" Then it needs two functions! We will endure.

@calebzulawski
Copy link
Member

Are 16 bit indices as universal as 8 bit?

{
/// Swizzle a vector of bytes according to the index vector.
/// Indices within range select the appropriate byte.
/// Indices "out of bounds" instead select 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note that this really needs build-std to work correctly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it kinda doesn't, does it? The vectors are generic, so this will get instantiated at compile time. What it needs is to be combined with target_feature configuration, either dynamic multiversioning or compile-time versioning or whatnot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cfgs will depend on the features std is built with, unfortunately

Copy link
Member Author

@workingjubilee workingjubilee Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh good point hmmmmmm...

...I guess I could make this dynamically multiversioned lol.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part, at least, will be fixed upon promoting this into an intrinsic.

@workingjubilee
Copy link
Member Author

workingjubilee commented Mar 13, 2023

Hmm. Honestly, 16-bit indices are only used, afaik, by RISC-V's Vector extension. So I think saying "nah, use a target intrinsic for that" would be fair. It's mostly "if we find a way to seamlessly transition to larger indices, that would be cool".

@programmerjake
Copy link
Member

Hmm. Honestly, 16-bit indices are only used, afaik, by RISC-V's Vector extension. So I think saying "nah, use a target intrinsic for that" would be fair. It's mostly "if we find a way to seamlessly transition to larger indices, that would be cool".

afaict avx512 supports 16-bit indexes for vpermi2w and related operations.

SimpleV supports 8/16/32/64-bit indexes iirc.

@workingjubilee
Copy link
Member Author

workingjubilee commented Mar 13, 2023

The AVX512 instruction, VPERMI2{W,D,Q} doesn't really matter to the abstract operation we're defining because that instruction uses indices that have the same size as the type, because it overwrites the index vector with the results (subject to the mask). And that's to be expected for AVX512F because without the AVX512BW extension, you can't do byte-level operations at all. But we don't really care about that because destructive update on the indices is a pretty unusual pattern, a form using that instruction should be yielded by combining it with a masked store (or the intrinsic, obv), and what is actually relevant for what we're doing is whether a u8 will be big enough.

@workingjubilee workingjubilee force-pushed the implement-basic-dyn-swizzle branch from 6d3ed39 to 3072258 Compare March 17, 2023 20:42
Copy link
Member Author

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As funny as it would be to package a CPUID implementation to handle the which-AVX-version stuff, I think I am gonna skip it for now. I could have completely skipped implementing this "up here" at the "tip", but I wanted to have full testing in our suite against proptest, first, which is very good at finding counterfactuals.

Comment on lines +82 to +86
// This is ordering sensitive, and LLVM will order these how you put them.
// Most AVX2 impls use ~5 "ports", and only 1 or 2 are capable of permutes.
// But the "compose" step will lower to ops that can also use at least 1 other port.
// So this tries to break up permutes so composition flows through "open" ports.
// Comparative benches should be done on multiple AVX2 CPUs before reordering this
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having all this commentary here isn't strictly necessary but I'm going to transplant more-or-less the same remarks into rustc (and maybe into LLVM???) later, so writing this down matters.

@workingjubilee workingjubilee merged commit 4f0d822 into master Apr 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants