-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast-track @threads
when nthreads() == 1
#32181
Conversation
This check should not happen at macro expansion time. |
Hello, The only way I have found to work around #15276 is to make sure that the code with Would this not make it harder to detect that the closure bug is preventing speedup in multi-threaded code? Cheers! |
26b7cdc
to
3cbd0c7
Compare
@raminammour while #15276 is something that can get triggered by code like this (and is in this case), this is fixing something independent of that. Yes, the If you want to test for #15276 style problems, I suggest you do a more direct test than relying on the fact that I've updated the original PR message with new performance metrics and a slightly updated benchmark script. |
Just FYI, the benchmarks on my system are different:
Hope this helps. |
It seems most of the slowdown is due to the closure bug, I get:
On |
The
If you look at the |
Sorry, my bad, here is a
|
Yes, that version does indeed remove some of the leftover performance gap; I'll include that in my benchmark above, but notice that you're still getting MUCH slower timings for even the serial case because you're using Julia 1.0; use Julia 1.1 or even better Before this PR:
After this PR:
|
@JeffBezanson, @yuyichao any further comments? I'm unsure why |
3cbd0c7
to
eb2de62
Compare
Pinging @JeffBezanson and @yuyichao again to see if there are any further comments, if not I think we should merge this, as it's a straight performance win when using |
How much speed up you get if you use https://github.com/JuliaLang/julia/pull/21452/files#diff-7198cded2577e0bdeb563f0f2713347bR69 instead? Also, that branch is somehow changed to use the latest world in #30838, which is presumably to be consistent with the threading branch. If that's the case and is intended, then this should do that as well which means that it cannot be compiled / inlined ahead of time. The threading branch is changed to use the latest world in https://github.com/JuliaLang/julia/pull/31398/files#diff-5a6699a5aa7cf07be50461e3c7f68262L693 and in particular the code was added in d9d8d4c. Why is that? That seems to be fairly different from the semantics before and the semantics of a normal for loop. Is that needed for something in particular? Otherwise, I don't really see the point of making this change. Another reason calling the function is preferred is to make sure the semantics of the loop is actually the same. This means that the single threaded Also, is the nested thread loop hack still needed? Isn't the point of |
@vtjnash ^^^ |
Yeah, the |
I agree that we shouldn't performance hack the |
I think you're asking what happens if I remove the The latest benchmarks:
I think this is a very "internals perspective" viewpoint; you are using your deep knowledge of the compiler and its idiosyncrasies to debug code that interacts with these hidden parts of the compiler, but from a user perspective, imposing a 40x slowdown (this is dodging the Ref/capturing issues, without dodging those it's an even worse slowdown) is pretty unacceptable. This performance slowdown is the single reason why
I've updated this PR to more closely match the scope semantics of the other logic branches. |
eb2de62
to
f047af7
Compare
No, I mean if you remove the |
It seems that it still doesn't have the same limitation as the other branches. All what I'm saying is that you should just use the |
f047af7
to
78eb422
Compare
That's a good idea @yuyichao; I didn't quite realize that the nested-threading case was essentially the same as my "single-thread" case. Changing this to just use that branch when |
In that case this looks good enough as is and the necessity of the invokelatest is just a separate issue. |
I think it is exactly the other way around. I as someone with an internals background and sufficient experience can guess at why a
Oh no! 4000x slow-down. For me one of the strong suites of Julia is performance predictability, and I feel that my micro-optimizing this case we make the performance model of a threaded loop (that there is a startup-cost to pay) more opaque and less user-friendly. |
FWIW, this is not a good example, since for nthreads>1 it (a) produces wrong results and (b) is slow. Both are for the same reason: All threads try to read and write to the same memory location concurrently. This gives unpredictable (i.e. wrong) results and also makes your poor CPU weep when trying to synchronize caches between cores. Single thread:
Two threads:
Do we have an actually correct (deterministic) example with significant (not O(1)) slowdown due to |
@chethega Sure, let's push the example farther toward reality. I've updated the benchmarks at the top to (a) have a slightly larger workload (1024 items) (b) actually compute something correctly no matter how many threads are assigned to it, and (c) remove the
I think when someone is chasing performance, it's okay to expect them to do a little reading. We should not avoid fast paths just because a slow path exists; ideally we would simply have extremely fast work division code and the
I take your point ;) and I should not have used that kind of comparison when I'm explicitly talking about very small problem sizes. These arise in things like |
This avoids overhead when threading is disabled, Example benchmark: ``` using BenchmarkTools, Base.Threads, Test function func(val, N) sums = [0*(1 .^ val) for thread_idx in 1:nthreads()] for idx in 1:N sums[threadid()] += idx.^val end return sum(sums) end function func_threaded(val, N) sums = [0*(1 .^ val) for thread_idx in 1:nthreads()] @threads for idx in 1:N sums[threadid()] += idx.^val end return sum(sums) end @test func(2.0, 1<<10) == func_threaded(2.0, 1<<10) @show @benchmark func(2.0, 1<<10) @show @benchmark func_threaded(2.0, 1<<10) ``` Running the benchmarks as: ``` for JULIA in julia-master ./julia; do for T in 1 2; do echo "$JULIA with $T threads:" JULIA_NUM_THREADS=$T $JULIA speedtest.jl done done ``` Before this commit: ``` julia-master with 1 threads: julia-master with 2 threads: ``` After this commit: ``` ./julia with 1 threads: ./julia with 2 threads: ```
78eb422
to
baa94c8
Compare
Try #32477? |
Fair enough. Even with #32477, I see no objection with the current variant. I think @yuyichao's comment referred to a previous version that you forced-pushed away. |
I briefly thought that would be a good idea, but you can't make that judgement as part of the macro since my problem size of 4 might be as work intensive as your problem of size
When I originally fixed #24688 it had taken a year and a half since we originally noticed some weirdness going on to use to nail down the issue. If we had simply shortcutted the semantic behaviour of It looks like #32477 brings down the overhead to 1us instead of 4us? From my perspective adding I can probably live with having a trigger/switch in the macro that enables the old behaviour. Then at least I can tell people, is your code with |
Not sure where to put this, but I wanted to try the last bench functions, and surprisingly do not get a speedup in the multi-threaded case (with 4 threads, on a i7-3770K, on Windows 7). Julia 1.1.1
I also tried this on the new alpha; allocations are much more important there: Julia 1.3.0 alpha
|
This avoids overhead when threading is disabled, Example benchmark:
I run the benchmarks as:
Before this PR:
After this PR: