-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
increase default threading cutoff #103
Comments
Hello Jeff, There is a parameter named It would be helpful if you could run a benchmark similar to this and post the output, once with all cores utilized and once with OPENBLAS_NUM_THREADS=1:
With OpenBLAS r0.1.1 on a Sandy Bridge machine (Core i5 2500K), I observe that a single thread has lower runtime with square matrices of dimension less than roughly 50. --Zaheer |
I haven't looked at the code, but how should this handle tall or fat matrices? Maybe some comprehensive benchmarking is required and compare to MKL. |
It would be nice if the threshold was set to 50, which to me is a better choice than setting it to 4. |
I tried various sizes and the cutoff is about 50 for me too. I imagine this might be different on different systems, but surely 4 is the wrong default. |
Hi all, I think this threshold is a naive implementation. We didn't consider very fat or tall matrix. Moreover, it may be relate to CPU type. I can increase this default value. It cannot fix this issue on all platform or matrices. According to the document of MKL, MKL supports automatically tuning the number of threads. I think OpenBLAS need the similar method to address this issue. For example, we may build a model to estimate the performance of single thread and multi-threads. Xianyi |
I ran some experiments with Intel MKL's DGEMM to determine when a single thread is used to execute the matrix multiplication C = A * B (where A: m x k, B: k x n, C: m x n). I varied only one dimension ('m') by powers of two and determined the largest size for the other two dimensions such that only one thread was used:
Several further observations:
--Zaheer |
If these kinds of parameters are CPU-dependent, perhaps a global optimal threshold doesn't exist; Perhaps part of the OpenBLAS compilation process should be to run a quick suite of tests, determine the optimal thresholds and record them somewhere? |
It should not be compile time for sure, or else it would be tuned only where compiled. We just need a list of CPU types and tuned parameters. People can report these for incorporation. -viral On 07-May-2012, at 9:28 AM, Elliot [email protected] wrote:
|
Hi @Sunqiao8964 Please follow this issue. Xianyi |
Ok I am on it |
Hello all |
Hi @Sunqiao8964 , Please check out our develop branch. We set a threshold (50) to control the number of threads. Xianyi |
For the experiments concerning the shape of matrix, the threshold is too high, and my version of threshold is about 27 |
this morning I did several sets of experiments which showed that the threshold for square matrices multiply is 29. But I think its a matter of the total number of elements in A and B. The function F(m , n , k) = k * (m + n) is what I am now testing to see if it is a useful function to decide threshold |
It would be really nice to see something done about this. I know determining the optimal threshold is hard, but trying to use threads for 10x10 matrices is clearly wrong. We routinely see performance ~2x slower than other environments for small matrix code because of this. This feels like a silly problem to have, in the midst of all the sophisticated code tuning that goes on inside openblas. |
@JeffBezanson , for your case, you can edit I want to explain the switch of single and multi-threading a little bit.
Thus, if m or n is less than the number of threads * SWITCH_RATIO, it will use single thread. Xianyi |
On 27.01.2014 05:13, Zhang Xianyi wrote:
I wrote a simple test, to show, that it's not simple to determine the
But to create appropriate rules, we will need a lot of statistical data. Best regards |
The perfect is the enemy of the good. I know handling this really well is difficult, but meanwhile we have to put up with a 2x slowdown for people working with tiny matrices. Here we have such an impressively well-tuned BLAS, and yet it is giving up a factor of 2 for 5x5 matrices, a case so easy you can basically keep everything in registers and write out every operation by hand. I guess we will just have to patch |
On 27.01.2014 18:21, Jeff Bezanson wrote:
for so small matrix sizes, OpenBLAS and other solutions like MKL, ACML Best regards |
Possibly, but I still get a 2x speedup just by setting |
This is starting to get embarrassing. Yet another thread about unexpectedly slow OpenBLAS performance: https://groups.google.com/forum/#!forum/julia-users. The solution is to manually set the OpenBLAS threading threshold to something less ridiculous than 4. |
Sorry, I miswrote that – you can't actually set the threshold to something more reasonable. Instead, you have to choose – on a computation-by-computation basis – whether to set
is complete nonsense. You can certainly collect statistical data all you want, but there simply isn't a "right" answer to this question. Whatever you pick is going to be imperfect on most systems. But at least it won't be as completely clearly wrong as 4. Literally anything would be better than 4. Except for 1, 2 and 3. At least make a guess that's not obviously dumb – 4 is an absurd threshold. If you change this to something like 150, then it will at least be close to right on many systems. |
Why was this closed? |
On 04.06.2014 17:21, Jeff Bezanson wrote:
please excuse me. Sorry Werner |
OK. I think my original issue title was bad. Carefully "tuning" the threading cutoff is an interesting and difficult problem, but that's not what I want here. I just want the default value to be higher than 4, which is insane. Trying to use threads for 5x5 matrices is essentially a bug. The value should just be raised to 20 or 40 or so, then I'll be much happier. |
Hi, I just ran dgemm tests on very small matrix sizes and found, if M_N_K < 4096 ( 16_16_16 ), the simple algorithm used in netlib-refblas is faster than single threaded OpenBLAS. Best regards |
Rather than writing special kernels, which is a significant task, could you please just increase the threshold to something reasonable first? |
On 13.06.2014 16:45, Stefan Karpinski wrote:
I can simply increase the threshold, but I think, that this will not I found two important points: The first point is at M_N_K < 4096, where a simple gemm kernel is faster The second point seems to be at M_N_K < 262144 ( 64_64_64 ), where a Best regards Werner |
Yes, it won't entirely solve the problem, but it will make it MUCH BETTER in the meantime. Please, increase the threshold to something more reasonable and then work on a more complete solution. |
Based on some statistical tests, I modified interface/gemm.c for small matrix dimensions with |
Ref #103: enhancement for small matrix dimensions. Fixed some bugs. Enable sgemm for SNB and dgemm for NEHALEM
In Julia, we have been using 50 as the threshold which works well. Given that there is a way to address this in the build system, this could be closed. |
I find it worrying that your threshold of 50 is so far from the current default, and if anything the threading behaviour has become more complicated with individual magic numbers used as threshold |
I'm not familiar with the openmp work, but one issue is that openmp is not available on all the platforms/OS combinations. |
I am certainly aware of that. I see OpenMP in this context as a subtopic that got buried quickly the first time it came up, and I would like to get some comment on it before it gets buried again. A speed gain |
My own experiments with the OMP scheduler were a bit inconclusive - which scheduler works best seems to depend very much on the functions and workload, so #1620 makes the choice available at |
Wouldn't it make sense to increase the default value of |
So far there does not appear to be consensus on what the good value is. The last time the threshold was changed, it was reverted quickly (#216), though inconsistent use of the variable in GEMM vs GEMV may have been an issue back then. With new contributors taking a fresh look on the threading behaviour of OpenBLAS now I hope this will eventually be resolved. |
With OPENBLAS_NUM_THREADS=1:
julia> a=rand(10,10); @time for i=1:100000; a*a; end
elapsed time: 0.11344099044799805 seconds
With OPENBLAS_NUM_THREADS=2:
julia> a=rand(10,10); @time for i=1:100000; a*a; end
elapsed time: 0.23067402839660645 seconds
This is openblas v0.1.1 on a core i5 650. There seems to be a slowdown for small matrices with threads enabled. In the same setup I see nice speedups for big matrices, e.g. over 100x100.
Can anything be done about this? Setting OPENBLAS_NUM_THREADS is easy, but we'd rather get the best of both worlds if possible :)
The text was updated successfully, but these errors were encountered: