-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scale! is overly slow for arrays of small or moderate size #6951
Comments
Any idea why the Is it possible that this is over-aggressive threading (similar to #6923), i.e. OpenBLAS is trying to use multiple threads for small sizes? Maybe try turning off multi-threading in OpenBLAS? cc: @staticfloat, @xianyi |
Even openblas copy is slower than you expect for small inputs. We should try threading fixes and also other BLAS. For now certainly update the scale implementation. |
This is the sort of thing that I would prefer to fix upstream; we aren't the only ones that this would affect. An |
If feasible this would of course be great to have fixed upstream. Still, it might be good to make it easier to switch between a pure Julia implementation and the Blas implementation. |
With
Still the BLAS routine is still way too slow for arrays of smaller size. The cross point now is about Even the threading problem is fixed upstream, we still need a cutoff so that it switches to the simple for-loop implementation when the size of the array is below some threshold. |
I did more tests with BLAS level 1 routines, with and without the setting of
The numbers in the table is the thresholding length: when the array lengths is equal to or above this threshold, OpenBLAS performs better than native loop. The testing script and detailed results are on the gist: https://gist.github.com/lindahua/7bcb03acca22da4e87b9 Looks like that |
Why does the performance gap shrink again at large sizes? |
@IainNZ When the array is large enough, the function is memory bounded (i.e. dominated by the cost of memory access, such as page faults, etc). In such cases, using SIMD or parallelism does not lead to very significant impact. |
I have filed an issue at OpenMathLib/OpenBLAS#375 |
Recently, I studied the performance of some basic functions. To my surprise, I found that
scale!
is two orders of magnitude slower than a for-loop for small arrays. It is still slower than the for-loop for arrays of1000
elements.The script for benchmarking is here: https://gist.github.com/lindahua/30fa06f62a6a13c89007
Below is the results I got on a macbook pro
Here,
length
is the length of the array. Performance is measured in terms of MPS, that is, Million numbers Per Second.When
length(x) < 16
, the for-loop is over 100x faster thanscale!
, which relies onBLAS.scal!
.Seriously, I think we should implement
scale!
as followsThe text was updated successfully, but these errors were encountered: