Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@inbounds slowing down a loop #26261

Closed
staticfloat opened this issue Feb 28, 2018 · 9 comments
Closed

@inbounds slowing down a loop #26261

staticfloat opened this issue Feb 28, 2018 · 9 comments
Labels
performance Must go faster

Comments

@staticfloat
Copy link
Member

staticfloat commented Feb 28, 2018

Some of the first Julia code I ever wrote was a xorshift1024* implementation, to try and compare speed against our native rand() code. I was looking at it again, and decided I would try and sprinkle some of the new Julia magic speed dust on it such as @inbounds, and was surprised to find that it actually made the whole thing slower:

Function in question:

function xorshift1024(N, s)
    x = Array{UInt64}(undef, N)
    p = 0
    for i = 1:N
        # One-based indexing strikes again
        s0 = s[p + 1]
        p = (p + 1)%16
        s1 = s[p + 1]

        # XOR block
        s1 = xor(s1, s1 << 31)
        s1 = xor(s1, s1 >> 11)
        s0 = xor(s0, s0 >> 30)
        s[p + 1] = xor(s0, s1)

        # Multiply by magic constant
        x[i] = s[p + 1] * 1181783497276652981
    end
    return x
end

I perform the allocation within xorshift1024() so as to better compare against rand(UInt64, N). Which I do down here:

using BenchmarkTools

# Collect some performance data
for N in round.(Int64, 2 .^range(12, 28, step=4))
    println("Testing generation of $N 64-bit numbers")
    s = rand(UInt64, 16)

    # compare xorshift against rand()
    @btime xorshift1024($N, $s)
    @btime rand(UInt64, $N)
end

Running this code gives:

Testing generation of 4096 64-bit numbers
  9.133 μs (2 allocations: 32.08 KiB)
  4.576 μs (4 allocations: 32.23 KiB)
Testing generation of 65536 64-bit numbers
  144.154 μs (2 allocations: 512.08 KiB)
  79.739 μs (4 allocations: 512.23 KiB)
Testing generation of 1048576 64-bit numbers
  2.315 ms (2 allocations: 8.00 MiB)
  2.118 ms (4 allocations: 8.00 MiB)
Testing generation of 16777216 64-bit numbers
  47.408 ms (2 allocations: 128.00 MiB)
  42.937 ms (4 allocations: 128.00 MiB)
Testing generation of 268435456 64-bit numbers
  755.848 ms (2 allocations: 2.00 GiB)
  700.324 ms (4 allocations: 2.00 GiB)

E.g. there is ~2x overhead for small numbers, but as we increase N, it slowly converges. However, if I introduce @inbounds to the for loop within the function:

Testing generation of 4096 64-bit numbers
  12.178 μs (2 allocations: 32.08 KiB)
  4.514 μs (4 allocations: 32.23 KiB)
Testing generation of 65536 64-bit numbers
  193.276 μs (2 allocations: 512.08 KiB)
  81.243 μs (4 allocations: 512.23 KiB)
Testing generation of 1048576 64-bit numbers
  3.094 ms (2 allocations: 8.00 MiB)
  2.165 ms (4 allocations: 8.00 MiB)
Testing generation of 16777216 64-bit numbers
  60.044 ms (2 allocations: 128.00 MiB)
  43.552 ms (4 allocations: 128.00 MiB)
Testing generation of 268435456 64-bit numbers
  965.672 ms (2 allocations: 2.00 GiB)
  700.540 ms (4 allocations: 2.00 GiB)

This is a surprising amount of performance drop, in my opinion.

@staticfloat staticfloat added the performance Must go faster label Feb 28, 2018
@tshort
Copy link
Contributor

tshort commented Mar 1, 2018

This seems to be system dependent. On my Windows laptop, I see the same slowdown you show here. On my Linux workstation, the @inbounds version is 10% faster (-O3 didn't make a difference).

@samoconnor
Copy link
Contributor

Could it be that the boundscheck is:
A) optimised away because the function is quite simple and,
B) giving the compiler (llvm?) some extra information that allows it to do better optimisation?

What does code_native say?

@StefanKarpinski
Copy link
Member

It would be interesting if we could separate out the informative part of bounds checks from the error part: i.e. allow the high-level code to communicate guarantees that it believes must be true even when we turn off checks that they are in fact true. That could, of course, lead to undefined, dangerous behavior if the claimed facts are not true, but that's always the case with incorrect usage of @inbounds so I don't think that's any worse than what we do now.

@samoconnor
Copy link
Contributor

If the bounds assertion has the right branch prediction hints, it's cost should be close to nothing (assuming that fetching the bounds is cheap and fetching the bounds is not done every time around a loop).
See __builtin_expect branch-prediction hint: http://llvm.org/docs/LangRef.html#llvm-expect-intrinsic
Related: #15495 (comment) #15495 (comment)

@samoconnor
Copy link
Contributor

I've found (mostly working with C code) that a liberal sprinkling of assert generally makes things faster and that the assumption that "we'll turn all the asserts off before we ship" is generally misguided. If the assert's are not doing algorithmically complex things inside loops they usually seem to have negligible overhead (and often help as hints to the compiler).
Of course if you have an assertion that walks an entire structure looking for broken references, that needs to be turned off in production, so you need a seperate debug_assert macro for that.

@nalimilan
Copy link
Member

The problem with bounds checks is that they generally/currently prevent some optimizations (e.g. #21402).

@vchuravy
Copy link
Member

vchuravy commented Mar 2, 2018

A cursory glance at the assembly shows a difference in the remainder calculation that LLVM optimised differently.

Replacing (p+1) % 16 with a p = ((p % UInt64 + 1) % 16) % Int64 or a ((p+1) & (16-1)) removes the performance difference and speeds up the code overall.
I don't know why the @inbounds makes that difference.

@KristofferC
Copy link
Member

Seems fixed now?

@StefanKarpinski
Copy link
Member

Closing. Can reopen if still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

7 participants