Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase default threading cutoff #103

Closed
JeffBezanson opened this issue May 2, 2012 · 40 comments
Closed

increase default threading cutoff #103

JeffBezanson opened this issue May 2, 2012 · 40 comments

Comments

@JeffBezanson
Copy link

With OPENBLAS_NUM_THREADS=1:

julia> a=rand(10,10); @time for i=1:100000; a*a; end
elapsed time: 0.11344099044799805 seconds

With OPENBLAS_NUM_THREADS=2:

julia> a=rand(10,10); @time for i=1:100000; a*a; end
elapsed time: 0.23067402839660645 seconds

This is openblas v0.1.1 on a core i5 650. There seems to be a slowdown for small matrices with threads enabled. In the same setup I see nice speedups for big matrices, e.g. over 100x100.

Can anything be done about this? Setting OPENBLAS_NUM_THREADS is easy, but we'd rather get the best of both worlds if possible :)

@zchothia
Copy link
Contributor

zchothia commented May 2, 2012

Hello Jeff,

There is a parameter named GEMM_MULTITHREAD_THRESHOLD in Makefile.rule which specifies the dimension below which the matrix multiplication will be performed with a single thread. Currently this is set to 4 by default, although this value may need some tuning.

It would be helpful if you could run a benchmark similar to this and post the output, once with all cores utilized and once with OPENBLAS_NUM_THREADS=1:

for n = 5:100:5
  A = rand(n, n);
  @time for i = 1:10000; A*A; end
end

With OpenBLAS r0.1.1 on a Sandy Bridge machine (Core i5 2500K), I observe that a single thread has lower runtime with square matrices of dimension less than roughly 50.

--Zaheer

@ViralBShah
Copy link
Contributor

I haven't looked at the code, but how should this handle tall or fat matrices? Maybe some comprehensive benchmarking is required and compare to MKL.

@ViralBShah
Copy link
Contributor

It would be nice if the threshold was set to 50, which to me is a better choice than setting it to 4.

@JeffBezanson
Copy link
Author

I tried various sizes and the cutoff is about 50 for me too. I imagine this might be different on different systems, but surely 4 is the wrong default.

@xianyi
Copy link
Collaborator

xianyi commented May 3, 2012

Hi all,

I think this threshold is a naive implementation. We didn't consider very fat or tall matrix. Moreover, it may be relate to CPU type.

I can increase this default value. It cannot fix this issue on all platform or matrices.

According to the document of MKL, MKL supports automatically tuning the number of threads. I think OpenBLAS need the similar method to address this issue. For example, we may build a model to estimate the performance of single thread and multi-threads.

Xianyi

@zchothia
Copy link
Contributor

zchothia commented May 3, 2012

I ran some experiments with Intel MKL's DGEMM to determine when a single thread is used to execute the matrix multiplication C = A * B (where A: m x k, B: k x n, C: m x n). I varied only one dimension ('m') by powers of two and determined the largest size for the other two dimensions such that only one thread was used:

   m |  n  |  k
 ----------------
   2 |  79 |  79
   4 |  77 |  77
   8 |  69 |  69
  16 |  49 |  49
  32 |  35 |  35
  64 |  25 |  25
 128 |  17 |  17
 256 |  12 |  12
 512 |   8 |   8
1024 |   4 |   4
2048 |   2 |   2

Several further observations:

  • If any dimension is equal to one (i.e. essentially matrix-vector multiplication) then just one thread is used regardless of the other dimensions.

  • Memory footprint for all three matrices is sizeof(double) * (m*k + k*n + m*n) bytes. In all the cases above this never exceeds 64 KB. For comparison the size of the L1 cache is 32 KB.

  • For the square case (m = n = k) we can derive the largest problem size which fits entirely into cache:

    sizeof(double) * 3n^2 == L1 cache
    24n^2 == 32768
    n = sqrt(32768 / 24) = 36.95
    

    This corresponds with an experiment where MKL did not trigger parallelism for sizes smaller than 33.

--Zaheer

@staticfloat
Copy link
Contributor

If these kinds of parameters are CPU-dependent, perhaps a global optimal threshold doesn't exist; Perhaps part of the OpenBLAS compilation process should be to run a quick suite of tests, determine the optimal thresholds and record them somewhere?

@ViralBShah
Copy link
Contributor

It should not be compile time for sure, or else it would be tuned only where compiled. We just need a list of CPU types and tuned parameters. People can report these for incorporation.

-viral

On 07-May-2012, at 9:28 AM, Elliot [email protected] wrote:

If these kinds of parameters are CPU-dependent, perhaps a global optimal threshold doesn't exist; Perhaps part of the OpenBLAS compilation process should be to run a quick suite of tests, determine the optimal thresholds and record them somewhere?


Reply to this email directly or view it on GitHub:
#103 (comment)

@ghost ghost assigned Sunqiao8964 May 10, 2012
@xianyi
Copy link
Collaborator

xianyi commented May 10, 2012

Hi @Sunqiao8964

Please follow this issue.

Xianyi

@Sunqiao8964
Copy link

Ok I am on it

@Sunqiao8964
Copy link

Hello all
I am a novice here, and I have a question
According to my observation, when m<16 or n < 16 , openblas only uses single thread no matter how many the OMP_NUM_THREADS is assigned. Is this your design or platform dependent? My platform is core i7

@xianyi
Copy link
Collaborator

xianyi commented May 18, 2012

Hi @Sunqiao8964 ,

Please check out our develop branch. We set a threshold (50) to control the number of threads.

Xianyi

@Sunqiao8964
Copy link

For the experiments concerning the shape of matrix, the threshold is too high, and my version of threshold is about 27

@Sunqiao8964
Copy link

this morning I did several sets of experiments which showed that the threshold for square matrices multiply is 29. But I think its a matter of the total number of elements in A and B. The function F(m , n , k) = k * (m + n) is what I am now testing to see if it is a useful function to decide threshold

@JeffBezanson
Copy link
Author

It would be really nice to see something done about this. I know determining the optimal threshold is hard, but trying to use threads for 10x10 matrices is clearly wrong. We routinely see performance ~2x slower than other environments for small matrix code because of this. This feels like a silly problem to have, in the midst of all the sophisticated code tuning that goes on inside openblas.

@xianyi
Copy link
Collaborator

xianyi commented Jan 27, 2014

@JeffBezanson , for your case, you can edit GEMM_MULTITHREAD_THRESHOLD = 10 in Makefile.rule.

I want to explain the switch of single and multi-threading a little bit.
In GotoBLAS/OpenBLAS, there is a parameter SWITCH_RATIO in param.h. The default value is 4 for Intel Sandybridge, Nehalem, etc.
In driver/level3/level3_thread.c,

  if ((args -> m < nthreads * SWITCH_RATIO) || (args -> n < nthreads * SWITCH_RATIO)) {
    GEMM_LOCAL(args, range_m, range_n, sa, sb, 0);
    return 0;
  }

Thus, if m or n is less than the number of threads * SWITCH_RATIO, it will use single thread.

Xianyi

@wernsaar
Copy link
Contributor

On 27.01.2014 05:13, Zhang Xianyi wrote:

@JeffBezanson , for your case, you can edit GEMM_MULTITHREAD_THRESHOLD = 10 in Makefile.rule.

I want to explain the switch of single and multi-threading a little bit.
In GotoBLAS/OpenBLAS, there is a parameter SWITCH_RATIO in param.h. The default value is 4 for Intel Sandybridge, Nehalem, etc.
In driver/level3/level3_thread.c,

   if ((args -> m < nthreads * SWITCH_RATIO) || (args -> n < nthreads * SWITCH_RATIO)) {
     GEMM_LOCAL(args, range_m, range_n, sa, sb, 0);
     return 0;
   }

Thus, if m or n is less than the number of threads * SWITCH_RATIO, it will use single thread.

Xianyi


Reply to this email directly or view it on GitHub:
#103 (comment)
Hi,

I wrote a simple test, to show, that it's not simple to determine the
best number
of core's with small problem sizes. Attached is a result for BULLDOZER.
There are two interesting points:

  1. Size between 320 and 384
    DGEMM_UNROLL_P is 384 for BULLDOZER and it seems that using a
    single core
    is better
  2. Size between 640 and 704
    There is a low point for 4 and 8 core's. But if I divide 672 by 4, the
    result is 168 and
    this is the value of DGEMM_UNROLL_Q.

But to create appropriate rules, we will need a lot of statistical data.

Best regards
Werner

@JeffBezanson
Copy link
Author

The perfect is the enemy of the good. I know handling this really well is difficult, but meanwhile we have to put up with a 2x slowdown for people working with tiny matrices. Here we have such an impressively well-tuned BLAS, and yet it is giving up a factor of 2 for 5x5 matrices, a case so easy you can basically keep everything in registers and write out every operation by hand. I guess we will just have to patch Makefile.rule.

@wernsaar
Copy link
Contributor

On 27.01.2014 18:21, Jeff Bezanson wrote:

The perfect is the enemy of the good. I know handling this really well is difficult, but meanwhile we have to put up with a 2x slowdown for people working with tiny matrices. Here we have such an impressively well-tuned BLAS, and yet it is giving up a factor of 2 for 5x5 matrices, a case so easy you can basically keep everything in registers and write out every operation by hand. I guess we will just have to patch Makefile.rule.


Reply to this email directly or view it on GitHub:
#103 (comment)
Hi,

for so small matrix sizes, OpenBLAS and other solutions like MKL, ACML
and ATLAS simply
have a too big overhead, because matrix A and B first have to copied
and/or transformed.
I think that even refblas from netlib would be faster.

Best regards
Werner

@JeffBezanson
Copy link
Author

Possibly, but I still get a 2x speedup just by setting OPENBLAS_NUM_THREADS=1. The resulting performance is reasonable and in line with what people expect, even if not totally optimal.

@StefanKarpinski
Copy link
Contributor

This is starting to get embarrassing. Yet another thread about unexpectedly slow OpenBLAS performance: https://groups.google.com/forum/#!forum/julia-users. The solution is to manually set the OpenBLAS threading threshold to something less ridiculous than 4.

@StefanKarpinski
Copy link
Contributor

Sorry, I miswrote that – you can't actually set the threshold to something more reasonable. Instead, you have to choose – on a computation-by-computation basis – whether to set OPENBLAS_NUM_THREADS=1 or not. This is a crazy and completely unreasonable interface. Excusing this terrible performance by saying

But to create appropriate rules, we will need a lot of statistical data.

is complete nonsense. You can certainly collect statistical data all you want, but there simply isn't a "right" answer to this question. Whatever you pick is going to be imperfect on most systems. But at least it won't be as completely clearly wrong as 4. Literally anything would be better than 4. Except for 1, 2 and 3. At least make a guess that's not obviously dumb – 4 is an absurd threshold. If you change this to something like 150, then it will at least be close to right on many systems.

@JeffBezanson
Copy link
Author

Why was this closed?

@wernsaar wernsaar reopened this Jun 4, 2014
@wernsaar
Copy link
Contributor

wernsaar commented Jun 4, 2014

On 04.06.2014 17:21, Jeff Bezanson wrote:

Why was this closed?


Reply to this email directly or view it on GitHub:
#103 (comment)
Hi,

please excuse me.
I am reviewing all open issues and I accidentally closed this issue.
I reopened it as a feature request.

Sorry

Werner

@JeffBezanson JeffBezanson changed the title tune threading cutoff increase default threading cutoff Jun 4, 2014
@JeffBezanson
Copy link
Author

OK.

I think my original issue title was bad. Carefully "tuning" the threading cutoff is an interesting and difficult problem, but that's not what I want here. I just want the default value to be higher than 4, which is insane. Trying to use threads for 5x5 matrices is essentially a bug. The value should just be raised to 20 or 40 or so, then I'll be much happier.

@wernsaar
Copy link
Contributor

Hi,

I just ran dgemm tests on very small matrix sizes and found, if M_N_K < 4096 ( 16_16_16 ), the simple algorithm used in netlib-refblas is faster than single threaded OpenBLAS.
Why, OpenBLAS has overhead for first copying/transpose the input matrices, a lot of read/write operations. I'am thinking about to write special gemm kernels for such small sizes.

Best regards
Werner

@StefanKarpinski
Copy link
Contributor

Rather than writing special kernels, which is a significant task, could you please just increase the threshold to something reasonable first?

@wernsaar
Copy link
Contributor

On 13.06.2014 16:45, Stefan Karpinski wrote:

Rather than writing special kernels, which is a significant task, could you just increase the threshold to something reasonable first?


Reply to this email directly or view it on GitHub:
#103 (comment)
Hi,

I can simply increase the threshold, but I think, that this will not
solve the problem.
We need a better estimation of how many threads should be used, depending on
the size of M, N and K.

I found two important points:

The first point is at M_N_K < 4096, where a simple gemm kernel is faster
than the OpenBLAS kernel.

The second point seems to be at M_N_K < 262144 ( 64_64_64 ), where a
single threaded kernel is
faster than a multithreaded kernel. But at this time, I don't have
enough statistical data for non square
matrix sizes.

Best regards

Werner

@StefanKarpinski
Copy link
Contributor

Yes, it won't entirely solve the problem, but it will make it MUCH BETTER in the meantime. Please, increase the threshold to something more reasonable and then work on a more complete solution.

@wernsaar
Copy link
Contributor

Based on some statistical tests, I modified interface/gemm.c for small matrix dimensions with
GEMM_MULTITHREAD_THRESHOLD=4. I think, that reasonable values for GEMM_MULTITHREAD_THRESHOLD are in the range from 2 to 8.

@ViralBShah
Copy link
Contributor

In Julia, we have been using 50 as the threshold which works well. Given that there is a way to address this in the build system, this could be closed.

@martin-frbg
Copy link
Collaborator

I find it worrying that your threshold of 50 is so far from the current default, and if anything the threading behaviour has become more complicated with individual magic numbers used as threshold
values in specific BLAS functions. So I am hesitating to close this - at the very least, the description
of this parameter in Makefile.rule should be amended to make it clear that the sane range of values
extends to (or it seems some may think, begins around) 50.
On reviewing this thread, there was also a comment by jiahao #103 (comment) about improvements seen with choosing a different OpenMP loop scheduler in blas_server_omp, which sadly elicited no response. Thoughts ?

@ViralBShah
Copy link
Contributor

I'm not familiar with the openmp work, but one issue is that openmp is not available on all the platforms/OS combinations.

@martin-frbg
Copy link
Collaborator

one issue is that openmp is not available on all the platforms

I am certainly aware of that. I see OpenMP in this context as a subtopic that got buried quickly the first time it came up, and I would like to get some comment on it before it gets buried again. A speed gain
from switching away from the "static" scheduler would appear to suggest an unbalanced workload where some threads take longer than others ?(Though the example given in JuliaLang/julia#3239 (comment) was a bit vague in that it picked the auto scheduler without determining what the compiler chose).

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 15, 2018

My own experiments with the OMP scheduler were a bit inconclusive - which scheduler works best seems to depend very much on the functions and workload, so #1620 makes the choice available at
build time (or by picking the runtime scheduler, even at runtime through setting OMP_SCHEDULE).
I have also documented the changed meaning of GEMM_MULTITHREAD_THRESHOLD in Makefile.rule and Julia*s use of the value 50 there.

@nalimilan
Copy link

Wouldn't it make sense to increase the default value of GEMM_MULTITHREAD_THRESHOLD? Given how far it is from Julia's choice, either OpenBLAS or Julia use a bad value.

@martin-frbg
Copy link
Collaborator

So far there does not appear to be consensus on what the good value is. The last time the threshold was changed, it was reverted quickly (#216), though inconsistent use of the variable in GEMM vs GEMV may have been an issue back then. With new contributors taking a fresh look on the threading behaviour of OpenBLAS now I hope this will eventually be resolved.

martin-frbg added a commit that referenced this issue Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests