-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance of saturating_mul can be improved by removing branches #65309
Comments
Edit: This comment is wrong. By the way, the "improved" example::saturating_sub:
xor eax, eax
cmp edi, esi
setl al
add eax, 2147483647
sub edi, esi
cmovno eax, edi
ret but I think example::saturating_sub:
xor eax, eax
sub edi, esi
setl al
add eax, 2147483647
cmovno eax, edi
ret Edit: Oops, the second subtraction is required for the |
You can also use #[inline]
const fn cond_if_else(cond: bool, a: i32, b: i32) -> i32 {
cond as i32 * a + !cond as i32 * b
} which compiles to the same thing (actually it seems like it outperforms the bit mask one at -O1) |
Interesting, when compiling the |
Yeah it seems the problem are the multiplications and additions that cause Rust to emit overflow checks at -O0. |
The branchless versions also vectorize much better: https://godbolt.org/z/URdJAl |
They seem to vectorize more, but the reciprocal throughput becomes worse. (I basically don't know anything about vectorization, so maybe rthroughput doesn't mean anything here, but maybe it does.) |
It looks like the improvement to Line 1061 in 0221e26
(a < 0) == (b < 0) : https://godbolt.org/z/dezgpj. Going further and replacing unwrap_or_else with unwrap_or gives exactly the same assembly as the branchless version above: https://godbolt.org/z/B0Ipxw. So I think it would be worth changing that conditional even if we don't go all the way to const .
|
Sounds good, I'll open a PR to change the multiplication as suggested. |
About the subtraction, LLVM is producing the following assembly for its example::saturating_sub:
xor eax, eax
mov ecx, edi
sub ecx, esi
setns al
add eax, 2147483647
sub edi, esi
cmovno eax, edi
ret I think this is suboptimal, and |
I believe none of these functions care about perf at O0 or O1 much, as stdlib is always compiled with full optimization and these methods are not generic. Consider also verifying how the code behaves when inlined in other contexts that do or do not provide sufficient information to LLVM to optimise out interesting parts (branches and expensive operations) of the function. |
@nagisa Since the improvement in throughput I observed can be obtained by simply changing |
improve performance of signed saturating_mul Reciprocal throughput is improved from 2.3 to 1.7. https://godbolt.org/z/ROMiX6 Fixes rust-lang#65309.
While playing with
saturating_mul
to see if it can be made branchless and thus a const fn (sorry @RalfJung ;) ), I found that the performance of signedsaturating_mul
can actually be improved.I tested two implementations:
The
cond_if_else
function acts kind of like C's ternary operator, including its constness, but works only for integers.The second case seems to have a reciprocal throughput better by a factor of about 1.5 according to llvm-mca: https://godbolt.org/z/6PnCwB
For unsigned, the second case can become:
In this case, there is no performance improvement and LLVM reduces this to the same IR, so the only advantage here would be constness. https://godbolt.org/z/0Fs0AD
I found some similar improvements for
saturating_sub
, even though currently the implementation usesintrinsics::saturating_sub
; I found it a bit weird that the intrinsic performs worse. The reciprocal throughput can be improved by a factor of 1.13: https://godbolt.org/z/tcZgeEMy concern is that since the LLVM IR is different, even though llvm-mca indicates the difference is an improvement, there might be some case I don't know about where the performance regresses. Maybe there are other platforms where the saturating intrinsics result better performance? I think @nikic has a better understanding of these concerns.
The text was updated successfully, but these errors were encountered: