@inbounds slowing down a loop #26261

staticfloat · 2018-02-28T18:11:18Z

Some of the first Julia code I ever wrote was a xorshift1024* implementation, to try and compare speed against our native rand() code. I was looking at it again, and decided I would try and sprinkle some of the new Julia magic speed dust on it such as @inbounds, and was surprised to find that it actually made the whole thing slower:

Function in question:

function xorshift1024(N, s)
    x = Array{UInt64}(undef, N)
    p = 0
    for i = 1:N
        # One-based indexing strikes again
        s0 = s[p + 1]
        p = (p + 1)%16
        s1 = s[p + 1]

        # XOR block
        s1 = xor(s1, s1 << 31)
        s1 = xor(s1, s1 >> 11)
        s0 = xor(s0, s0 >> 30)
        s[p + 1] = xor(s0, s1)

        # Multiply by magic constant
        x[i] = s[p + 1] * 1181783497276652981
    end
    return x
end

I perform the allocation within xorshift1024() so as to better compare against rand(UInt64, N). Which I do down here:

using BenchmarkTools

# Collect some performance data
for N in round.(Int64, 2 .^range(12, 28, step=4))
    println("Testing generation of $N 64-bit numbers")
    s = rand(UInt64, 16)

    # compare xorshift against rand()
    @btime xorshift1024($N, $s)
    @btime rand(UInt64, $N)
end

Running this code gives:

Testing generation of 4096 64-bit numbers
  9.133 μs (2 allocations: 32.08 KiB)
  4.576 μs (4 allocations: 32.23 KiB)
Testing generation of 65536 64-bit numbers
  144.154 μs (2 allocations: 512.08 KiB)
  79.739 μs (4 allocations: 512.23 KiB)
Testing generation of 1048576 64-bit numbers
  2.315 ms (2 allocations: 8.00 MiB)
  2.118 ms (4 allocations: 8.00 MiB)
Testing generation of 16777216 64-bit numbers
  47.408 ms (2 allocations: 128.00 MiB)
  42.937 ms (4 allocations: 128.00 MiB)
Testing generation of 268435456 64-bit numbers
  755.848 ms (2 allocations: 2.00 GiB)
  700.324 ms (4 allocations: 2.00 GiB)

E.g. there is ~2x overhead for small numbers, but as we increase N, it slowly converges. However, if I introduce @inbounds to the for loop within the function:

Testing generation of 4096 64-bit numbers
  12.178 μs (2 allocations: 32.08 KiB)
  4.514 μs (4 allocations: 32.23 KiB)
Testing generation of 65536 64-bit numbers
  193.276 μs (2 allocations: 512.08 KiB)
  81.243 μs (4 allocations: 512.23 KiB)
Testing generation of 1048576 64-bit numbers
  3.094 ms (2 allocations: 8.00 MiB)
  2.165 ms (4 allocations: 8.00 MiB)
Testing generation of 16777216 64-bit numbers
  60.044 ms (2 allocations: 128.00 MiB)
  43.552 ms (4 allocations: 128.00 MiB)
Testing generation of 268435456 64-bit numbers
  965.672 ms (2 allocations: 2.00 GiB)
  700.540 ms (4 allocations: 2.00 GiB)

This is a surprising amount of performance drop, in my opinion.

The text was updated successfully, but these errors were encountered:

tshort · 2018-03-01T03:17:21Z

This seems to be system dependent. On my Windows laptop, I see the same slowdown you show here. On my Linux workstation, the @inbounds version is 10% faster (-O3 didn't make a difference).

samoconnor · 2018-03-01T12:38:41Z

Could it be that the boundscheck is:
A) optimised away because the function is quite simple and,
B) giving the compiler (llvm?) some extra information that allows it to do better optimisation?

What does code_native say?

StefanKarpinski · 2018-03-01T14:21:54Z

It would be interesting if we could separate out the informative part of bounds checks from the error part: i.e. allow the high-level code to communicate guarantees that it believes must be true even when we turn off checks that they are in fact true. That could, of course, lead to undefined, dangerous behavior if the claimed facts are not true, but that's always the case with incorrect usage of @inbounds so I don't think that's any worse than what we do now.

samoconnor · 2018-03-01T20:43:38Z

If the bounds assertion has the right branch prediction hints, it's cost should be close to nothing (assuming that fetching the bounds is cheap and fetching the bounds is not done every time around a loop).
See __builtin_expect branch-prediction hint: http://llvm.org/docs/LangRef.html#llvm-expect-intrinsic
Related: #15495 (comment) #15495 (comment)

samoconnor · 2018-03-01T20:58:18Z

I've found (mostly working with C code) that a liberal sprinkling of assert generally makes things faster and that the assumption that "we'll turn all the asserts off before we ship" is generally misguided. If the assert's are not doing algorithmically complex things inside loops they usually seem to have negligible overhead (and often help as hints to the compiler).
Of course if you have an assertion that walks an entire structure looking for broken references, that needs to be turned off in production, so you need a seperate debug_assert macro for that.

nalimilan · 2018-03-02T14:17:44Z

The problem with bounds checks is that they generally/currently prevent some optimizations (e.g. #21402).

vchuravy · 2018-03-02T15:08:26Z

A cursory glance at the assembly shows a difference in the remainder calculation that LLVM optimised differently.

Replacing (p+1) % 16 with a p = ((p % UInt64 + 1) % 16) % Int64 or a ((p+1) & (16-1)) removes the performance difference and speeds up the code overall.
I don't know why the @inbounds makes that difference.

KristofferC · 2020-06-23T12:56:13Z

Seems fixed now?

StefanKarpinski · 2021-04-07T17:19:59Z

Closing. Can reopen if still an issue.

staticfloat added the performance Must go faster label Feb 28, 2018

maleadt mentioned this issue Jan 21, 2021

Could @inbounds communicate its (removed) invariants to LLVM? #39340

Open

StefanKarpinski closed this as completed Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@inbounds slowing down a loop #26261

@inbounds slowing down a loop #26261

staticfloat commented Feb 28, 2018 •

edited by KristofferC

Loading

tshort commented Mar 1, 2018

samoconnor commented Mar 1, 2018

StefanKarpinski commented Mar 1, 2018

samoconnor commented Mar 1, 2018

samoconnor commented Mar 1, 2018

nalimilan commented Mar 2, 2018

vchuravy commented Mar 2, 2018

KristofferC commented Jun 23, 2020

StefanKarpinski commented Apr 7, 2021

@inbounds slowing down a loop #26261

@inbounds slowing down a loop #26261

Comments

staticfloat commented Feb 28, 2018 • edited by KristofferC Loading

tshort commented Mar 1, 2018

samoconnor commented Mar 1, 2018

StefanKarpinski commented Mar 1, 2018

samoconnor commented Mar 1, 2018

samoconnor commented Mar 1, 2018

nalimilan commented Mar 2, 2018

vchuravy commented Mar 2, 2018

KristofferC commented Jun 23, 2020

StefanKarpinski commented Apr 7, 2021

staticfloat commented Feb 28, 2018 •

edited by KristofferC

Loading