Lapack tests slower with OPENBLAS_NUM_THREADS>1 #192

bashseb · 2013-01-29T10:50:47Z

Hello,
thank you for providing the openblas lib. My interest is to use your optimized library. I'm not sure if this is the correct place to ask such questions. If no, please let me know.

 OpenBLAS build complete.

  OS               ... Linux
  Architecture     ... x86_64
  BINARY           ... 64bit
  C compiler       ... GCC  (command line : gcc)
  Fortran compiler ... GFORTRAN  (command line : gfortran)
  Library Name     ... libopenblas_sandybridgep-r0.2.5.a (Multi threaded; Max num-threads is 32)

I'm trying to use openblas (git) and as a first test tried to compile lapack-3.4.2. I added BLASLIB = /path/to/libopenblas.a -lpthread to make.inc.

I then tested the lapack tests with different export OPENBLAS_NUM_THREADS=i. I did make cleantesting and make. I can observe that i cores are utilized, but the test results of i>1 are slower than for i=1 or with the system blas (which beats openblas on most tests for i=1). Furthermore, a value of i=32 is interpreted as i=16.

I must have obviously done something wrong., but I don't know exactly how to narrow it down. I'd appreciate any hints.
I've put the result of i=32 on a pastebin http://pastebin.com/BQATuymz

The text was updated successfully, but these errors were encountered:

xianyi · 2013-01-30T03:15:09Z

Hi @bashseb ,

Thank you for the feedback.

What's your CPU? I think it enables hyper-threading feature. Thus, the performance of i=32 is same as i=16.

How do you run lapack test?

What's your system blas? Is it Intel MKL?

Xianyi

bashseb · 2013-01-30T11:25:48Z

thanks @xianyi for your assistance. My CPU is a "Xeon(R) CPU E5-2690 0 @ 2.90GHz". It has 32 threads, 16 physical cores and 2 sockets. I just looked at htop and saw that 16 cores are fully utilized. Now I know that this is the expected behaviour.

I ran the lapack test that is included in the lapack-3.4.2. I guess it's not really a benchmark, but just verification that it works or not. But it outputs numbers on the runtime (see the pastebin). I'm trying to run the hpl benchmark (http://www.netlib.org/benchmark/hpl/) and will report the numbers - is this a good idea?

My system blas is the default CentOS release 6.3 (Final) blas-devel-3.2.1-4.el6.x86_64. gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) (same for gfortran). Here is the log of the lapack tests with the system BLAS http://pastebin.com/ATaUNti4.

EDIT: I looked at hpl and found it a bit confusing. Also, it requires MPI. I'd prefer a more basic benchmark. What's the easiest way to asses the speed of openblas vs system blas?

EDIT2: I guess this cpu counts as 'sandybridge'. I think in this case my compiler is too old and I need AVC, right?

thanks a load!

xianyi · 2013-01-30T15:21:20Z

Hi @bashseb ,

I just checked the outputs of LAPACK test. Becasue OpenBLAS uses multithreading and the input matrix is very small, the performance is low than the default BLAS. This issue is same as #103 .

HPL is a very good benchmark. Yes, it need MPI. I don't know other basic benchmark for BLAS.
@zchothia ,@ViralBShah, any comments?

I think Red Hat already applied the sandy bridge patch to this gcc version. Thus, you don't need to update your compiler.

Xianyi

bashseb · 2013-01-30T20:49:26Z

@xianyi thanks a lot. HPL benchmarks clearly show a run time advantage of openblas. I multiplied the default Ns by 100 and NBs by 10 and tested it with 1, 2 and 4 openblas threads and 4 openmpi jobs. I see a speed up of about a factor >5 . i=1 yields the best performance. This is probably because the problem is still quite small.

ViralBShah · 2013-02-03T03:19:02Z

HPL is perhaps the best benchmark, but the MPI makes it difficult to measure BLAS performance.

concretevitamin · 2014-08-04T05:12:37Z

Hi @xianyi -- can you elaborate on OpenBlas' behavior when encountered hyper-threading on a virtual machine (e.g. Amazon EC2 instances)?

A concrete scenario: let's say we have 4 physical cores and with HT, 8 threads (2 logical cores per physical core). Is it then valid to set OPENBLAS_NUM_THREADS from 1 up to 4?

What if I used export OPENBLAS_MAIN_FREE=1 -- is it expected then OPENBLAS_NUM_THREADS=8 will be more performant than setting it to 4?

xianyi · 2014-08-04T05:34:09Z

Hi @concretevitamin ,

It depends on your benchmark. For example, DGEMM and other BLAS3 functions are compute-intensive. Thus, those function cannot get benefit from HT. The performance of OPENBLAS_NUM_THREADS=8 may slower than 4.

Xianyi

ViralBShah · 2014-08-04T06:30:36Z

Yes, HPL is a good benchmark for a distributed machine, but it requires MPI and a whole host of other tuning to do seriously. If the goal is just to measure BLAS performance, I would just time the DGEMMs, or run peakflops if you are using julia, for example.

concretevitamin · 2014-08-04T06:47:41Z

Thanks for your response. In my case, I am wondering if GEQRF could benefit
from HT.

On Sunday, August 3, 2014, Zhang Xianyi [email protected] wrote:

Hi @concretevitamin https://github.com/concretevitamin ,

It depends on your benchmark. For example, DGEMM and other BLAS3 functions
are compute-intensive. Thus, those function cannot get benefit from HT. The
performance of OPENBLAS_NUM_THREADS=8 may slower than 4.

Xianyi

—
Reply to this email directly or view it on GitHub
#192 (comment).

bashseb closed this as completed Jan 30, 2013

bashseb reopened this Jan 30, 2013

bashseb closed this as completed Jan 30, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lapack tests slower with OPENBLAS_NUM_THREADS>1 #192

Lapack tests slower with OPENBLAS_NUM_THREADS>1 #192

bashseb commented Jan 29, 2013

xianyi commented Jan 30, 2013

bashseb commented Jan 30, 2013

xianyi commented Jan 30, 2013

bashseb commented Jan 30, 2013

ViralBShah commented Feb 3, 2013

concretevitamin commented Aug 4, 2014

xianyi commented Aug 4, 2014

ViralBShah commented Aug 4, 2014

concretevitamin commented Aug 4, 2014

Lapack tests slower with OPENBLAS_NUM_THREADS>1 #192

Lapack tests slower with OPENBLAS_NUM_THREADS>1 #192

Comments

bashseb commented Jan 29, 2013

xianyi commented Jan 30, 2013

bashseb commented Jan 30, 2013

xianyi commented Jan 30, 2013

bashseb commented Jan 30, 2013

ViralBShah commented Feb 3, 2013

concretevitamin commented Aug 4, 2014

xianyi commented Aug 4, 2014

ViralBShah commented Aug 4, 2014

concretevitamin commented Aug 4, 2014