performance of saturating_mul can be improved by removing branches #65309

tspiteri · 2019-10-11T13:34:56Z

While playing with saturating_mul to see if it can be made branchless and thus a const fn (sorry @RalfJung ;) ), I found that the performance of signed saturating_mul can actually be improved.

I tested two implementations:

pub fn saturating_mul(a: i32, b: i32) -> i32 {
    a.saturating_mul(b)
}

#[inline]
const fn cond_if_else(cond: bool, a: i32, b: i32) -> i32 {
    // If cond is false: not_mask is -1 == all ones
    // If cond is true: not_mask is 0 == all zeros
    let not_mask = (cond as i32).wrapping_sub(1);
    // If cond is false: (a & !all_ones) | (b & all_ones) == b
    // If cond is true: (a & !all_zeros) | (b & all_zeros) == a
    (a & !not_mask) | (b & not_mask)
}

pub const fn saturating_mul(a: i32, b: i32) -> i32 {
    let (val, overflow) = a.overflowing_mul(b);
    cond_if_else(
        !overflow,
        val,
        cond_if_else(
            (a < 0) != (b < 0),
            i32::min_value(),
            i32::max_value(),
        ),
    )
}

The cond_if_else function acts kind of like C's ternary operator, including its constness, but works only for integers.

The second case seems to have a reciprocal throughput better by a factor of about 1.5 according to llvm-mca: https://godbolt.org/z/6PnCwB

For unsigned, the second case can become:

pub const fn saturating_mul(a: u32, b: u32) -> u32 {
    let (val, overflow) = a.overflowing_mul(b);
    cond_if_else(!overflow, val, u32::max_value())
}

In this case, there is no performance improvement and LLVM reduces this to the same IR, so the only advantage here would be constness. https://godbolt.org/z/0Fs0AD

I found some similar improvements for saturating_sub, even though currently the implementation uses intrinsics::saturating_sub; I found it a bit weird that the intrinsic performs worse. The reciprocal throughput can be improved by a factor of 1.13: https://godbolt.org/z/tcZgeE

My concern is that since the LLVM IR is different, even though llvm-mca indicates the difference is an improvement, there might be some case I don't know about where the performance regresses. Maybe there are other platforms where the saturating intrinsics result better performance? I think @nikic has a better understanding of these concerns.

The text was updated successfully, but these errors were encountered:

tspiteri · 2019-10-11T14:09:24Z

Edit: This comment is wrong.

By the way, the "improved" saturating_sub above can be further improved. LLVM produces

example::saturating_sub:
        xor     eax, eax
        cmp     edi, esi
        setl    al
        add     eax, 2147483647
        sub     edi, esi
        cmovno  eax, edi
        ret

but I think cmp can be changed to sub immediately, as the next time edi is used it is to do the subtraction anyway:

example::saturating_sub:
        xor     eax, eax
        sub     edi, esi
        setl    al
        add     eax, 2147483647
        cmovno  eax, edi
        ret

Edit: Oops, the second subtraction is required for the cmovno instruction.

CryZe · 2019-10-11T14:14:08Z

You can also use

#[inline]
const fn cond_if_else(cond: bool, a: i32, b: i32) -> i32 {
    cond as i32 * a + !cond as i32 * b
}

which compiles to the same thing (actually it seems like it outperforms the bit mask one at -O1)

tspiteri · 2019-10-11T14:22:40Z

Interesting, when compiling the cond_if_else function itself at -O0, the masks version performs better (masks 7, mul 9), at -O1 LLVM recognizes the multiplication pattern so it performs better (masks 5.2, mul 1), and at -O2 LLVM recognizes both patterns so that both have a reciprocal throughput of 1.

CryZe · 2019-10-11T14:27:34Z

Yeah it seems the problem are the multiplications and additions that cause Rust to emit overflow checks at -O0.

CryZe · 2019-10-11T14:37:52Z

The branchless versions also vectorize much better: https://godbolt.org/z/URdJAl

tspiteri · 2019-10-11T14:44:44Z

The branchless versions also vectorize much better

They seem to vectorize more, but the reciprocal throughput becomes worse. (I basically don't know anything about vectorization, so maybe rthroughput doesn't mean anything here, but maybe it does.)

ollie27 · 2019-10-11T14:51:01Z

It looks like the improvement to saturating_mul actually comes from replacing (a < 0 && b < 0) || (a > 0 && b > 0):

rust/src/libcore/num/mod.rs

Line 1061 in 0221e26

if (self < 0 && rhs < 0) || (self > 0 && rhs > 0) {

with (a < 0) == (b < 0): https://godbolt.org/z/dezgpj. Going further and replacing unwrap_or_else with unwrap_or gives exactly the same assembly as the branchless version above: https://godbolt.org/z/B0Ipxw. So I think it would be worth changing that conditional even if we don't go all the way to const.

tspiteri · 2019-10-11T14:55:14Z

Sounds good, I'll open a PR to change the multiplication as suggested.

tspiteri · 2019-10-11T14:58:01Z

About the subtraction, LLVM is producing the following assembly for its llvm.ssub.sat.i32 intrinsic:

example::saturating_sub:
        xor     eax, eax
        mov     ecx, edi
        sub     ecx, esi
        setns   al
        add     eax, 2147483647
        sub     edi, esi
        cmovno  eax, edi
        ret

I think this is suboptimal, and mov ecx, edi followed by sub ecx, esi should be replaced by cmp edi, esi. I'll open an issue in LLVM.

tspiteri · 2019-10-11T15:02:26Z

https://bugs.llvm.org/show_bug.cgi?id=43649

nagisa · 2019-10-11T15:22:25Z

I believe none of these functions care about perf at O0 or O1 much, as stdlib is always compiled with full optimization and these methods are not generic.

Consider also verifying how the code behaves when inlined in other contexts that do or do not provide sufficient information to LLVM to optimise out interesting parts (branches and expensive operations) of the function.

tspiteri · 2019-10-11T15:25:38Z

@nagisa Since the improvement in throughput I observed can be obtained by simply changing (a < 0 && b < 0) || (a > 0 && b > 0) to (a < 0) == (b < 0), I opened a PR with just that change which I think looks quite safe and doesn't have much room for adverse effects.

improve performance of signed saturating_mul Reciprocal throughput is improved from 2.3 to 1.7. https://godbolt.org/z/ROMiX6 Fixes rust-lang#65309.

tspiteri mentioned this issue Oct 11, 2019

improve performance of signed saturating_mul #65312

Merged

bors closed this as completed in f0f5e77 Oct 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of saturating_mul can be improved by removing branches #65309

performance of saturating_mul can be improved by removing branches #65309

tspiteri commented Oct 11, 2019 •

edited

Loading

tspiteri commented Oct 11, 2019 •

edited

Loading

CryZe commented Oct 11, 2019 •

edited

Loading

tspiteri commented Oct 11, 2019

CryZe commented Oct 11, 2019

CryZe commented Oct 11, 2019

tspiteri commented Oct 11, 2019

ollie27 commented Oct 11, 2019

tspiteri commented Oct 11, 2019

tspiteri commented Oct 11, 2019

tspiteri commented Oct 11, 2019

nagisa commented Oct 11, 2019

tspiteri commented Oct 11, 2019

performance of saturating_mul can be improved by removing branches #65309

performance of saturating_mul can be improved by removing branches #65309

Comments

tspiteri commented Oct 11, 2019 • edited Loading

tspiteri commented Oct 11, 2019 • edited Loading

CryZe commented Oct 11, 2019 • edited Loading

tspiteri commented Oct 11, 2019

CryZe commented Oct 11, 2019

CryZe commented Oct 11, 2019

tspiteri commented Oct 11, 2019

ollie27 commented Oct 11, 2019

tspiteri commented Oct 11, 2019

tspiteri commented Oct 11, 2019

tspiteri commented Oct 11, 2019

nagisa commented Oct 11, 2019

tspiteri commented Oct 11, 2019

tspiteri commented Oct 11, 2019 •

edited

Loading

tspiteri commented Oct 11, 2019 •

edited

Loading

CryZe commented Oct 11, 2019 •

edited

Loading