Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of saturating_mul can be improved by removing branches #65309

Closed
tspiteri opened this issue Oct 11, 2019 · 12 comments · Fixed by #65312
Closed

performance of saturating_mul can be improved by removing branches #65309

tspiteri opened this issue Oct 11, 2019 · 12 comments · Fixed by #65312
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@tspiteri
Copy link
Contributor

tspiteri commented Oct 11, 2019

While playing with saturating_mul to see if it can be made branchless and thus a const fn (sorry @RalfJung ;) ), I found that the performance of signed saturating_mul can actually be improved.

I tested two implementations:

  1. pub fn saturating_mul(a: i32, b: i32) -> i32 {
        a.saturating_mul(b)
    }
  2. #[inline]
    const fn cond_if_else(cond: bool, a: i32, b: i32) -> i32 {
        // If cond is false: not_mask is -1 == all ones
        // If cond is true: not_mask is 0 == all zeros
        let not_mask = (cond as i32).wrapping_sub(1);
        // If cond is false: (a & !all_ones) | (b & all_ones) == b
        // If cond is true: (a & !all_zeros) | (b & all_zeros) == a
        (a & !not_mask) | (b & not_mask)
    }
    
    pub const fn saturating_mul(a: i32, b: i32) -> i32 {
        let (val, overflow) = a.overflowing_mul(b);
        cond_if_else(
            !overflow,
            val,
            cond_if_else(
                (a < 0) != (b < 0),
                i32::min_value(),
                i32::max_value(),
            ),
        )
    }

The cond_if_else function acts kind of like C's ternary operator, including its constness, but works only for integers.

The second case seems to have a reciprocal throughput better by a factor of about 1.5 according to llvm-mca: https://godbolt.org/z/6PnCwB

For unsigned, the second case can become:

pub const fn saturating_mul(a: u32, b: u32) -> u32 {
    let (val, overflow) = a.overflowing_mul(b);
    cond_if_else(!overflow, val, u32::max_value())
}

In this case, there is no performance improvement and LLVM reduces this to the same IR, so the only advantage here would be constness. https://godbolt.org/z/0Fs0AD

I found some similar improvements for saturating_sub, even though currently the implementation uses intrinsics::saturating_sub; I found it a bit weird that the intrinsic performs worse. The reciprocal throughput can be improved by a factor of 1.13: https://godbolt.org/z/tcZgeE

My concern is that since the LLVM IR is different, even though llvm-mca indicates the difference is an improvement, there might be some case I don't know about where the performance regresses. Maybe there are other platforms where the saturating intrinsics result better performance? I think @nikic has a better understanding of these concerns.

@jonas-schievink jonas-schievink added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 11, 2019
@tspiteri
Copy link
Contributor Author

tspiteri commented Oct 11, 2019

Edit: This comment is wrong.

By the way, the "improved" saturating_sub above can be further improved. LLVM produces

example::saturating_sub:
        xor     eax, eax
        cmp     edi, esi
        setl    al
        add     eax, 2147483647
        sub     edi, esi
        cmovno  eax, edi
        ret

but I think cmp can be changed to sub immediately, as the next time edi is used it is to do the subtraction anyway:

example::saturating_sub:
        xor     eax, eax
        sub     edi, esi
        setl    al
        add     eax, 2147483647
        cmovno  eax, edi
        ret

Edit: Oops, the second subtraction is required for the cmovno instruction.

@CryZe
Copy link
Contributor

CryZe commented Oct 11, 2019

You can also use

#[inline]
const fn cond_if_else(cond: bool, a: i32, b: i32) -> i32 {
    cond as i32 * a + !cond as i32 * b
}

which compiles to the same thing (actually it seems like it outperforms the bit mask one at -O1)

@tspiteri
Copy link
Contributor Author

Interesting, when compiling the cond_if_else function itself at -O0, the masks version performs better (masks 7, mul 9), at -O1 LLVM recognizes the multiplication pattern so it performs better (masks 5.2, mul 1), and at -O2 LLVM recognizes both patterns so that both have a reciprocal throughput of 1.

@CryZe
Copy link
Contributor

CryZe commented Oct 11, 2019

Yeah it seems the problem are the multiplications and additions that cause Rust to emit overflow checks at -O0.

@CryZe
Copy link
Contributor

CryZe commented Oct 11, 2019

The branchless versions also vectorize much better: https://godbolt.org/z/URdJAl

@tspiteri
Copy link
Contributor Author

The branchless versions also vectorize much better

They seem to vectorize more, but the reciprocal throughput becomes worse. (I basically don't know anything about vectorization, so maybe rthroughput doesn't mean anything here, but maybe it does.)

@ollie27
Copy link
Member

ollie27 commented Oct 11, 2019

It looks like the improvement to saturating_mul actually comes from replacing (a < 0 && b < 0) || (a > 0 && b > 0):

if (self < 0 && rhs < 0) || (self > 0 && rhs > 0) {
with (a < 0) == (b < 0): https://godbolt.org/z/dezgpj. Going further and replacing unwrap_or_else with unwrap_or gives exactly the same assembly as the branchless version above: https://godbolt.org/z/B0Ipxw. So I think it would be worth changing that conditional even if we don't go all the way to const.

@tspiteri
Copy link
Contributor Author

Sounds good, I'll open a PR to change the multiplication as suggested.

@tspiteri
Copy link
Contributor Author

About the subtraction, LLVM is producing the following assembly for its llvm.ssub.sat.i32 intrinsic:

example::saturating_sub:
        xor     eax, eax
        mov     ecx, edi
        sub     ecx, esi
        setns   al
        add     eax, 2147483647
        sub     edi, esi
        cmovno  eax, edi
        ret

I think this is suboptimal, and mov ecx, edi followed by sub ecx, esi should be replaced by cmp edi, esi. I'll open an issue in LLVM.

@tspiteri
Copy link
Contributor Author

@nagisa
Copy link
Member

nagisa commented Oct 11, 2019

I believe none of these functions care about perf at O0 or O1 much, as stdlib is always compiled with full optimization and these methods are not generic.

Consider also verifying how the code behaves when inlined in other contexts that do or do not provide sufficient information to LLVM to optimise out interesting parts (branches and expensive operations) of the function.

@tspiteri
Copy link
Contributor Author

@nagisa Since the improvement in throughput I observed can be obtained by simply changing (a < 0 && b < 0) || (a > 0 && b > 0) to (a < 0) == (b < 0), I opened a PR with just that change which I think looks quite safe and doesn't have much room for adverse effects.

Centril added a commit to Centril/rust that referenced this issue Oct 13, 2019
improve performance of signed saturating_mul

Reciprocal throughput is improved from 2.3 to 1.7. https://godbolt.org/z/ROMiX6

Fixes rust-lang#65309.
@bors bors closed this as completed in f0f5e77 Oct 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants