increase default threading cutoff #103

JeffBezanson · 2012-05-02T01:11:47Z

With OPENBLAS_NUM_THREADS=1:

julia> a=rand(10,10); @time for i=1:100000; a*a; end
elapsed time: 0.11344099044799805 seconds

With OPENBLAS_NUM_THREADS=2:

julia> a=rand(10,10); @time for i=1:100000; a*a; end
elapsed time: 0.23067402839660645 seconds

This is openblas v0.1.1 on a core i5 650. There seems to be a slowdown for small matrices with threads enabled. In the same setup I see nice speedups for big matrices, e.g. over 100x100.

Can anything be done about this? Setting OPENBLAS_NUM_THREADS is easy, but we'd rather get the best of both worlds if possible :)

zchothia · 2012-05-02T08:38:13Z

Hello Jeff,

There is a parameter named GEMM_MULTITHREAD_THRESHOLD in Makefile.rule which specifies the dimension below which the matrix multiplication will be performed with a single thread. Currently this is set to 4 by default, although this value may need some tuning.

It would be helpful if you could run a benchmark similar to this and post the output, once with all cores utilized and once with OPENBLAS_NUM_THREADS=1:

for n = 5:100:5
  A = rand(n, n);
  @time for i = 1:10000; A*A; end
end

With OpenBLAS r0.1.1 on a Sandy Bridge machine (Core i5 2500K), I observe that a single thread has lower runtime with square matrices of dimension less than roughly 50.

--Zaheer

ViralBShah · 2012-05-02T09:13:50Z

I haven't looked at the code, but how should this handle tall or fat matrices? Maybe some comprehensive benchmarking is required and compare to MKL.

ViralBShah · 2012-05-02T17:05:46Z

It would be nice if the threshold was set to 50, which to me is a better choice than setting it to 4.

JeffBezanson · 2012-05-02T20:58:21Z

I tried various sizes and the cutoff is about 50 for me too. I imagine this might be different on different systems, but surely 4 is the wrong default.

xianyi · 2012-05-03T11:55:35Z

Hi all,

I think this threshold is a naive implementation. We didn't consider very fat or tall matrix. Moreover, it may be relate to CPU type.

I can increase this default value. It cannot fix this issue on all platform or matrices.

According to the document of MKL, MKL supports automatically tuning the number of threads. I think OpenBLAS need the similar method to address this issue. For example, we may build a model to estimate the performance of single thread and multi-threads.

Xianyi

zchothia · 2012-05-03T20:49:26Z

I ran some experiments with Intel MKL's DGEMM to determine when a single thread is used to execute the matrix multiplication C = A * B (where A: m x k, B: k x n, C: m x n). I varied only one dimension ('m') by powers of two and determined the largest size for the other two dimensions such that only one thread was used:

   m |  n  |  k
 ----------------
   2 |  79 |  79
   4 |  77 |  77
   8 |  69 |  69
  16 |  49 |  49
  32 |  35 |  35
  64 |  25 |  25
 128 |  17 |  17
 256 |  12 |  12
 512 |   8 |   8
1024 |   4 |   4
2048 |   2 |   2

Several further observations:

If any dimension is equal to one (i.e. essentially matrix-vector multiplication) then just one thread is used regardless of the other dimensions.
Memory footprint for all three matrices is sizeof(double) * (m*k + k*n + m*n) bytes. In all the cases above this never exceeds 64 KB. For comparison the size of the L1 cache is 32 KB.
For the square case (m = n = k) we can derive the largest problem size which fits entirely into cache:
```
sizeof(double) * 3n^2 == L1 cache
24n^2 == 32768
n = sqrt(32768 / 24) = 36.95
```
This corresponds with an experiment where MKL did not trigger parallelism for sizes smaller than 33.

--Zaheer

staticfloat · 2012-05-07T03:58:01Z

If these kinds of parameters are CPU-dependent, perhaps a global optimal threshold doesn't exist; Perhaps part of the OpenBLAS compilation process should be to run a quick suite of tests, determine the optimal thresholds and record them somewhere?

ViralBShah · 2012-05-07T05:13:26Z

It should not be compile time for sure, or else it would be tuned only where compiled. We just need a list of CPU types and tuned parameters. People can report these for incorporation.

-viral

On 07-May-2012, at 9:28 AM, Elliot [email protected] wrote:

If these kinds of parameters are CPU-dependent, perhaps a global optimal threshold doesn't exist; Perhaps part of the OpenBLAS compilation process should be to run a quick suite of tests, determine the optimal thresholds and record them somewhere?

Reply to this email directly or view it on GitHub:
#103 (comment)

xianyi · 2012-05-10T14:41:59Z

Hi @Sunqiao8964

Please follow this issue.

Xianyi

Sunqiao8964 · 2012-05-14T00:51:48Z

Ok I am on it

Sunqiao8964 · 2012-05-18T14:00:02Z

Hello all
I am a novice here, and I have a question
According to my observation, when m<16 or n < 16 , openblas only uses single thread no matter how many the OMP_NUM_THREADS is assigned. Is this your design or platform dependent? My platform is core i7

xianyi · 2012-05-18T14:03:03Z

Hi @Sunqiao8964 ,

Please check out our develop branch. We set a threshold (50) to control the number of threads.

Xianyi

Sunqiao8964 · 2012-05-18T23:54:04Z

For the experiments concerning the shape of matrix, the threshold is too high, and my version of threshold is about 27

Sunqiao8964 · 2012-05-19T00:46:13Z

this morning I did several sets of experiments which showed that the threshold for square matrices multiply is 29. But I think its a matter of the total number of elements in A and B. The function F(m , n , k) = k * (m + n) is what I am now testing to see if it is a useful function to decide threshold

JeffBezanson · 2014-01-21T06:23:34Z

It would be really nice to see something done about this. I know determining the optimal threshold is hard, but trying to use threads for 10x10 matrices is clearly wrong. We routinely see performance ~2x slower than other environments for small matrix code because of this. This feels like a silly problem to have, in the midst of all the sophisticated code tuning that goes on inside openblas.

xianyi · 2014-01-27T04:13:57Z

@JeffBezanson , for your case, you can edit GEMM_MULTITHREAD_THRESHOLD = 10 in Makefile.rule.

I want to explain the switch of single and multi-threading a little bit.
In GotoBLAS/OpenBLAS, there is a parameter SWITCH_RATIO in param.h. The default value is 4 for Intel Sandybridge, Nehalem, etc.
In driver/level3/level3_thread.c,

  if ((args -> m < nthreads * SWITCH_RATIO) || (args -> n < nthreads * SWITCH_RATIO)) {
    GEMM_LOCAL(args, range_m, range_n, sa, sb, 0);
    return 0;
  }

Thus, if m or n is less than the number of threads * SWITCH_RATIO, it will use single thread.

Xianyi

wernsaar · 2014-01-27T16:31:49Z

On 27.01.2014 05:13, Zhang Xianyi wrote:

@JeffBezanson , for your case, you can edit GEMM_MULTITHREAD_THRESHOLD = 10 in Makefile.rule.

I want to explain the switch of single and multi-threading a little bit.
In GotoBLAS/OpenBLAS, there is a parameter SWITCH_RATIO in param.h. The default value is 4 for Intel Sandybridge, Nehalem, etc.
In driver/level3/level3_thread.c,
   if ((args -> m < nthreads * SWITCH_RATIO) || (args -> n < nthreads * SWITCH_RATIO)) {
     GEMM_LOCAL(args, range_m, range_n, sa, sb, 0);
     return 0;
   }
Thus, if m or n is less than the number of threads * SWITCH_RATIO, it will use single thread.

Xianyi

Reply to this email directly or view it on GitHub:
#103 (comment)
Hi,

I wrote a simple test, to show, that it's not simple to determine the
best number
of core's with small problem sizes. Attached is a result for BULLDOZER.
There are two interesting points:

Size between 320 and 384
DGEMM_UNROLL_P is 384 for BULLDOZER and it seems that using a
single core
is better
Size between 640 and 704
There is a low point for 4 and 8 core's. But if I divide 672 by 4, the
result is 168 and
this is the value of DGEMM_UNROLL_Q.

But to create appropriate rules, we will need a lot of statistical data.

Best regards
Werner

JeffBezanson · 2014-01-27T17:21:17Z

The perfect is the enemy of the good. I know handling this really well is difficult, but meanwhile we have to put up with a 2x slowdown for people working with tiny matrices. Here we have such an impressively well-tuned BLAS, and yet it is giving up a factor of 2 for 5x5 matrices, a case so easy you can basically keep everything in registers and write out every operation by hand. I guess we will just have to patch Makefile.rule.

wernsaar · 2014-01-27T18:02:12Z

On 27.01.2014 18:21, Jeff Bezanson wrote:

The perfect is the enemy of the good. I know handling this really well is difficult, but meanwhile we have to put up with a 2x slowdown for people working with tiny matrices. Here we have such an impressively well-tuned BLAS, and yet it is giving up a factor of 2 for 5x5 matrices, a case so easy you can basically keep everything in registers and write out every operation by hand. I guess we will just have to patch Makefile.rule.

Reply to this email directly or view it on GitHub:
#103 (comment)
Hi,

for so small matrix sizes, OpenBLAS and other solutions like MKL, ACML
and ATLAS simply
have a too big overhead, because matrix A and B first have to copied
and/or transformed.
I think that even refblas from netlib would be faster.

Best regards
Werner

JeffBezanson · 2014-01-28T04:54:58Z

Possibly, but I still get a 2x speedup just by setting OPENBLAS_NUM_THREADS=1. The resulting performance is reasonable and in line with what people expect, even if not totally optimal.

StefanKarpinski · 2014-02-24T15:11:21Z

This is starting to get embarrassing. Yet another thread about unexpectedly slow OpenBLAS performance: https://groups.google.com/forum/#!forum/julia-users. The solution is to manually set the OpenBLAS threading threshold to something less ridiculous than 4.

StefanKarpinski · 2014-02-24T15:30:40Z

Sorry, I miswrote that – you can't actually set the threshold to something more reasonable. Instead, you have to choose – on a computation-by-computation basis – whether to set OPENBLAS_NUM_THREADS=1 or not. This is a crazy and completely unreasonable interface. Excusing this terrible performance by saying

But to create appropriate rules, we will need a lot of statistical data.

is complete nonsense. You can certainly collect statistical data all you want, but there simply isn't a "right" answer to this question. Whatever you pick is going to be imperfect on most systems. But at least it won't be as completely clearly wrong as 4. Literally anything would be better than 4. Except for 1, 2 and 3. At least make a guess that's not obviously dumb – 4 is an absurd threshold. If you change this to something like 150, then it will at least be close to right on many systems.

JeffBezanson · 2014-06-04T15:21:13Z

Why was this closed?

wernsaar · 2014-06-04T15:29:04Z

On 04.06.2014 17:21, Jeff Bezanson wrote:

Why was this closed?

Reply to this email directly or view it on GitHub:
#103 (comment)
Hi,

please excuse me.
I am reviewing all open issues and I accidentally closed this issue.
I reopened it as a feature request.

Sorry

Werner

JeffBezanson · 2014-06-04T15:35:55Z

OK.

I think my original issue title was bad. Carefully "tuning" the threading cutoff is an interesting and difficult problem, but that's not what I want here. I just want the default value to be higher than 4, which is insane. Trying to use threads for 5x5 matrices is essentially a bug. The value should just be raised to 20 or 40 or so, then I'll be much happier.

wernsaar · 2014-06-13T10:48:05Z

Hi,

I just ran dgemm tests on very small matrix sizes and found, if M_N_K < 4096 ( 16_16_16 ), the simple algorithm used in netlib-refblas is faster than single threaded OpenBLAS.
Why, OpenBLAS has overhead for first copying/transpose the input matrices, a lot of read/write operations. I'am thinking about to write special gemm kernels for such small sizes.

Best regards
Werner

StefanKarpinski · 2014-06-13T14:45:57Z

Rather than writing special kernels, which is a significant task, could you please just increase the threshold to something reasonable first?

wernsaar · 2014-06-13T15:17:58Z

On 13.06.2014 16:45, Stefan Karpinski wrote:

Rather than writing special kernels, which is a significant task, could you just increase the threshold to something reasonable first?

Reply to this email directly or view it on GitHub:
#103 (comment)
Hi,

I can simply increase the threshold, but I think, that this will not
solve the problem.
We need a better estimation of how many threads should be used, depending on
the size of M, N and K.

I found two important points:

The first point is at M_N_K < 4096, where a simple gemm kernel is faster
than the OpenBLAS kernel.

The second point seems to be at M_N_K < 262144 ( 64_64_64 ), where a
single threaded kernel is
faster than a multithreaded kernel. But at this time, I don't have
enough statistical data for non square
matrix sizes.

Best regards

Werner

StefanKarpinski · 2014-06-13T15:46:31Z

Yes, it won't entirely solve the problem, but it will make it MUCH BETTER in the meantime. Please, increase the threshold to something more reasonable and then work on a more complete solution.

wernsaar · 2014-06-18T13:21:40Z

Based on some statistical tests, I modified interface/gemm.c for small matrix dimensions with
GEMM_MULTITHREAD_THRESHOLD=4. I think, that reasonable values for GEMM_MULTITHREAD_THRESHOLD are in the range from 2 to 8.

Ref #103: enhancement for small matrix dimensions. Fixed some bugs. Enable sgemm for SNB and dgemm for NEHALEM

ViralBShah · 2018-05-29T23:59:08Z

In Julia, we have been using 50 as the threshold which works well. Given that there is a way to address this in the build system, this could be closed.

martin-frbg · 2018-06-01T11:50:15Z

I find it worrying that your threshold of 50 is so far from the current default, and if anything the threading behaviour has become more complicated with individual magic numbers used as threshold
values in specific BLAS functions. So I am hesitating to close this - at the very least, the description
of this parameter in Makefile.rule should be amended to make it clear that the sane range of values
extends to (or it seems some may think, begins around) 50.
On reviewing this thread, there was also a comment by jiahao #103 (comment) about improvements seen with choosing a different OpenMP loop scheduler in blas_server_omp, which sadly elicited no response. Thoughts ?

ViralBShah · 2018-06-01T16:19:42Z

I'm not familiar with the openmp work, but one issue is that openmp is not available on all the platforms/OS combinations.

martin-frbg · 2018-06-01T21:49:16Z

one issue is that openmp is not available on all the platforms

I am certainly aware of that. I see OpenMP in this context as a subtopic that got buried quickly the first time it came up, and I would like to get some comment on it before it gets buried again. A speed gain
from switching away from the "static" scheduler would appear to suggest an unbalanced workload where some threads take longer than others ?(Though the example given in JuliaLang/julia#3239 (comment) was a bit vague in that it picked the auto scheduler without determining what the compiler chose).

martin-frbg · 2018-06-15T09:31:45Z

My own experiments with the OMP scheduler were a bit inconclusive - which scheduler works best seems to depend very much on the functions and workload, so #1620 makes the choice available at
build time (or by picking the runtime scheduler, even at runtime through setting OMP_SCHEDULE).
I have also documented the changed meaning of GEMM_MULTITHREAD_THRESHOLD in Makefile.rule and Julia*s use of the value 50 there.

nalimilan · 2018-06-15T09:50:10Z

Wouldn't it make sense to increase the default value of GEMM_MULTITHREAD_THRESHOLD? Given how far it is from Julia's choice, either OpenBLAS or Julia use a bad value.

martin-frbg · 2018-06-15T10:04:05Z

So far there does not appear to be consensus on what the good value is. The last time the threshold was changed, it was reverted quickly (#216), though inconsistent use of the variable in GEMM vs GEMV may have been an issue back then. With new contributors taking a fresh look on the threading behaviour of OpenBLAS now I hope this will eventually be resolved.

rebase

xianyi added a commit that referenced this issue May 3, 2012

refs #103 Increase GEMM_MULTITHREAD_THRESHOLD to 50.

7f89ede

ghost assigned Sunqiao8964 May 10, 2012

xianyi mentioned this issue Jan 30, 2013

Lapack tests slower with OPENBLAS_NUM_THREADS>1 #192

Closed

ghost mentioned this issue May 2, 2013

Please revert commit 7f89edee3efce16b3a6db1a4382b432770acee21 #216

Closed

xianyi mentioned this issue Jun 18, 2013

Poor performance of OpenBLAS matrix-vector multiplication on small sizes JuliaLang/julia#3239

Closed

ViralBShah mentioned this issue Jun 30, 2013

BLAS performance tests JuliaLang/julia#3566

Closed

nutsiepully mentioned this issue Aug 19, 2013

performance of our OpenBLAS configuration JuliaLang/julia#2852

Closed

JeffBezanson mentioned this issue Jan 27, 2014

openblas threads slow for small matrices JuliaLang/julia#3965

Closed

tkelman mentioned this issue May 22, 2014

Use OpenBLAS in OSX nightlies JuliaLang/julia#6887

Closed

wernsaar closed this as completed Jun 4, 2014

wernsaar reopened this Jun 4, 2014

JeffBezanson changed the title ~~tune threading cutoff~~ increase default threading cutoff Jun 4, 2014

StefanKarpinski mentioned this issue Jun 13, 2014

WIP: Bump OpenBLAS to 0.2.9 and LAPACK to 3.5.0 JuliaLang/julia#7213

Merged

xianyi added a commit that referenced this issue Jun 27, 2014

Merge pull request #390 from wernsaar/develop

d10db52

Ref #103: enhancement for small matrix dimensions. Fixed some bugs. Enable sgemm for SNB and dgemm for NEHALEM

emberian mentioned this issue Mar 18, 2015

Performance change blas-lapack-rs/blas-sys#4

Closed

xianyi mentioned this issue Oct 29, 2015

Dynamic number of threads #678

Closed

jeromerobert mentioned this issue Jan 14, 2016

hardware for GEMM_MULTITHREAD_THRESHOLD=4 #741

Closed

brada4 mentioned this issue Jan 16, 2016

Pathological threading behavior in OpenBLAS -> makes my test suite 4x slower than if using reference BLAS #731

Closed

yuyichao mentioned this issue Mar 31, 2016

Support larger machines with OpenBLAS JuliaLang/LinearAlgebra.jl#323

Closed

yuyichao mentioned this issue Sep 22, 2016

increase default maximum number of openblas threads on power JuliaLang/julia#18638

Closed

martin-frbg mentioned this issue Jun 14, 2018

Add build-time option for OMP scheduler; document MULTITHREAD_THRESHOLD range #1620

Merged

martin-frbg closed this as completed Jun 15, 2018

martin-frbg added a commit that referenced this issue Oct 19, 2020

Merge pull request #103 from xianyi/develop

9cac379

rebase

increase default threading cutoff #103

increase default threading cutoff #103

Comments

JeffBezanson commented May 2, 2012

zchothia commented May 2, 2012

ViralBShah commented May 2, 2012

ViralBShah commented May 2, 2012

JeffBezanson commented May 2, 2012

xianyi commented May 3, 2012

zchothia commented May 3, 2012

staticfloat commented May 7, 2012

ViralBShah commented May 7, 2012

xianyi commented May 10, 2012

Sunqiao8964 commented May 14, 2012

Sunqiao8964 commented May 18, 2012

xianyi commented May 18, 2012

Sunqiao8964 commented May 18, 2012

Sunqiao8964 commented May 19, 2012

JeffBezanson commented Jan 21, 2014

xianyi commented Jan 27, 2014

wernsaar commented Jan 27, 2014

JeffBezanson commented Jan 27, 2014

wernsaar commented Jan 27, 2014

JeffBezanson commented Jan 28, 2014

StefanKarpinski commented Feb 24, 2014

StefanKarpinski commented Feb 24, 2014

JeffBezanson commented Jun 4, 2014

wernsaar commented Jun 4, 2014

JeffBezanson commented Jun 4, 2014

wernsaar commented Jun 13, 2014

StefanKarpinski commented Jun 13, 2014

wernsaar commented Jun 13, 2014

StefanKarpinski commented Jun 13, 2014

wernsaar commented Jun 18, 2014

ViralBShah commented May 29, 2018

martin-frbg commented Jun 1, 2018

ViralBShah commented Jun 1, 2018

martin-frbg commented Jun 1, 2018

martin-frbg commented Jun 15, 2018 • edited Loading

nalimilan commented Jun 15, 2018

martin-frbg commented Jun 15, 2018

martin-frbg commented Jun 15, 2018 •

edited

Loading