Eig is slow compared to mathematica and octave #72

timholy · 2014-02-08T13:09:55Z

On StackOverflow a user pointed out that eig seems slow compared to Mathematica: http://stackoverflow.com/questions/21641621/eigendecompositions-are-5-times-slower-in-julia-than-in-mathematica

I compared against Matlab (I don't have Mathematica), and it's about a factor of 2 slower. I profiled it, and it indicates that all the time is in linalg/lapack.jl; geevx!; line: 1209. I briefly read the docs for dgeevx, and I confess I don't understand why that ccall is being made twice.

Moreover, eig doesn't accept keywords, but ?eig suggests it should take a balance keyword.

The text was updated successfully, but these errors were encountered:

andreasnoack · 2014-02-08T13:20:42Z

On my Mac Julia is the faster of the two. A guess could be the he uses reference BLAS and LAPACK for Julia but haven't found a way to make Mathematica do the same.

The first ccall determines the correct size for the work space matrices and the second does the calculation.

Regarding the keywords, it is my mistake. I am preparing pull request. I also forgot eigvals.

timholy · 2014-02-08T13:34:31Z

Thanks for explaining re the ccall, and for the upcoming doc fixes.

DumpsterDoofus · 2014-02-08T19:43:04Z

So I asked MathematicaSE and apparently Windows Mathematica uses the Intel MKL BLAS. I am using Julia 0.2.0 with the OpenBLAS architecture, which seems 3 to 6 times slower than Mathematica on my machine. Is this reasonable?

If I change Julia to use the Intel MKL, will it speed up? Or are more complicated things at play?

andreasnoack · 2014-02-08T20:51:09Z

The performance of OpenBLAS depends on the architecture, but it seems that something is not right on Windows. I have just tried your example on a Windows server and the result depends on the number of threads used. Unfortunately, in the wrong direction. I get

julia> blas_set_num_threads(1)
julia> @time (E,F)=eig(D);
elapsed time: 3.576688676 seconds (80089088 bytes allocated)
julia> blas_set_num_threads(2)
julia> @time (E,F)=eig(D);
elapsed time: 9.650217604 seconds (80089088 bytes allocated)
julia> blas_set_num_threads(3)
julia> @time (E,F)=eig(D);
elapsed time: 10.071440396 seconds (80089088 bytes allocated)

On the same server Mathematica spends 5-6 seconds.

If you use MKL when compiling julia, I think that you should expect timings comparable to Mathematica.

My info is

Julia Version 0.3.0-prerelease+1400
Commit 6f3a4b6* (2014-02-05 19:14 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

cc.@xianyi

DumpsterDoofus · 2014-02-08T22:35:40Z

@andreasnoack: That's a strange issue, and I can reproduce it on my machine:

D=1000*(rand(1000,1000)-0.5);
blas_set_num_threads(1)
@time (E,F)=eig(D);
blas_set_num_threads(1)
@time (E,F)=eig(D);
blas_set_num_threads(2)
@time (E,F)=eig(D);
blas_set_num_threads(2)
@time (E,F)=eig(D);
blas_set_num_threads(3)
@time (E,F)=eig(D);
blas_set_num_threads(3)
@time (E,F)=eig(D);

Output:

elapsed time: 1.770069354 seconds (79719608 bytes allocated)
elapsed time: 1.705695403 seconds (79719608 bytes allocated)
elapsed time: 6.403318095 seconds (79719608 bytes allocated)
elapsed time: 6.3963143 seconds (79719608 bytes allocated)
elapsed time: 6.997508242 seconds (79719608 bytes allocated)
elapsed time: 7.00552684 seconds (79719608 bytes allocated)

Is this some kind of bug? Should it be faster when there are more threads dedicated? Also, the above timings are really strange! Before I changed the number of threads, it was routinely taking 7.5 seconds to compute! Let me play around a little more, and see what I can find...

DumpsterDoofus · 2014-02-08T22:42:29Z

Okay, I restarted the kernel and once again ran

D=1000*(rand(1000,1000)-0.5);
@time (E,F)=eig(D);

and it took 7.4 seconds on average. After setting blas_set_num_threads(1) and executing, it took 1.7 seconds on average, which is roughly what I get on Mathematica. So something is definitely wrong with the default settings for BLAS, or at least there's a problem on my and andreasnoackjensen's builds of Julia.

So I don't know much about Julia, but is there a way to determine the default number of BLAS threads used by the kernel?

andreasnoack · 2015-11-19T03:08:49Z

Would someone with Windows access please try and see if this is still an issue?

tkelman · 2015-11-19T03:31:32Z

worse, actually - now it segfaults:

julia> versioninfo()
Julia Version 0.4.2-pre+12
Commit 9e598b6* (2015-11-19 02:34 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> D=1000*(rand(1000,1000)-0.5);

julia> blas_set_num_threads(1)

julia> @time (E,F)=eig(D);

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x4fbcec0 -- openblas_get_num_threads64_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
openblas_get_num_threads64_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
dgemv64_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
dlahr264_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
DGEHRD64_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
DGEEVX64_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
geevx! at linalg/lapack.jl:1726
eigfact! at linalg/eigen.jl:32
julia_eigfact!_1551 at  (unknown line)
eigfact at linalg/eigen.jl:57
eig at linalg/eigen.jl:66
julia_eig_1546 at  (unknown line)
jl_apply_generic at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_interpret_toplevel_expr at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_interpret_toplevel_thunk_with at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_interpret_toplevel_thunk_with at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_interpret_toplevel_expr_in at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_interpret_toplevel_thunk_with at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_eval_with_compiler_p at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_f_tuple at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
eval_user_input at REPL.jl:62
jlcall_eval_user_input_1308 at  (unknown line)
jl_apply_generic at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
anonymous at REPL.jl:92
jl_switchto at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)

tkelman · 2015-11-19T03:33:56Z

The backtrace might be a bit wrong there, in gdb it looks like this:

julia> @time (E,F)=eig(D);

Program received signal SIGSEGV, Segmentation fault.
0x0000000004a4cec0 in blas_memory_free () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
(gdb) bt
#0  0x0000000004a4cec0 in blas_memory_free () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
JuliaLang/julia#1  0x0000000004800738 in dgemv_ () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
JuliaLang/julia#2  0x000000000622957b in dlahr2_ () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
JuliaLang/julia#3  0x00000000061e96e9 in dgehrd_ () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
JuliaLang/julia#4  0x00000000061e477a in dgeevx_ () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
JuliaLang/julia#5  0x000000000e57cb9f in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

andreasnoack · 2015-11-19T03:43:02Z

Do you only get the segfault when running single threaded?

tkelman · 2015-11-19T03:44:44Z

No, segfaults with blas_set_num_threads(4) too.

andreasnoack · 2015-11-19T03:49:23Z

It's a bit surprising that no users have complained about this.

tkelman · 2015-11-19T03:54:15Z

And that the tests didn't show this as soon as we upgraded to 0.2.15. I also tried with an 0.4 rc that was using 0.2.14 and the performance looked similar to what was reported above. For @xianyi's benefit, can we come up with standalone test cases in C or Fortran that would call dgeevx64_ equivalently? Here's how the openblas library was built:

make -C openblas CC=x86_64-w64-mingw32-gcc -march=x86-64 -m64 FC=x86_64-w64-mingw32-gfortran -march=x86-64 -m64 RANLIB=x86_64-w64-mingw32-ranlib FFLAGS= -O2  TARGET= BINARY=64 USE_THREAD=1 GEMM_MULTITHREADING_THRESHOLD=50 NUM_THREADS=16 NO_AFFINITY=1 DYNAMIC_ARCH=1 INTERFACE64=1 SYMBOLSUFFIX=64_ LIBPREFIX=libopenblas64_ OSNAME=WINNT CROSS=1 HOSTCC=gcc

 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... WINNT
  Architecture     ... x86_64
  BINARY           ... 64bit
  Use 64 bits int    (equivalent to "-i8" in Fortran)
  C compiler       ... GCC  (command line : x86_64-w64-mingw32-gcc -march=x86-64 -m64)
  Fortran compiler ... GFORTRAN  (command line : x86_64-w64-mingw32-gfortran -march=x86-64 -m64)
  Library Name     ... libopenblas64_p-r0.2.15.a (Multi threaded; Max num-threads is 16)

andreasnoack · 2015-11-19T04:06:27Z

For @xianyi's benefit, can we come up with standalone test cases in C or Fortran that would call dgeevx64_ equivalently?

Yes, we should do that, but it's a bit of a pain because it's an "expert driver routine" and therefore takes 23 arguments.

tkelman · 2015-11-19T04:12:34Z

We should totally write an automatic test-case-generator that converts a given ccall invocation into standalone C. https://xkcd.com/974

tkelman · 2015-11-23T09:20:34Z

Reported the segfault upstream at OpenMathLib/OpenBLAS#697 - the simplified dgeev was enough to make this happen (64 bit Cygwin's Octave hits the same segfault, and they use dgeev rather than dgeevx for eig).

ViralBShah · 2016-02-17T12:01:00Z

On my mac, this takes 15 seconds on eig(rand(1500,1500)) with one BLAS thread, and 9 seconds if I use the default multi-threading in openblas. Octave's [d,v] = eig(rand(1500,1500)) takes 5 seconds, where I am using Octave from homebrew.

ViralBShah · 2016-02-17T12:01:55Z

Removing the windows label since this is the case on mac as well. @andreasnoack Can you verify once more?

tkelman · 2016-02-18T02:51:15Z

I have a theory. This might not be an openblas-vs-mkl problem, it might be a gfortran-vs-ifort problem. Can anyone on Linux or OSX who has access to Intel compilers try building openblas using icc and ifort, and compare that to MKL/Accelerate?

ViralBShah · 2016-02-18T04:57:49Z

Cc @ranjanan

ViralBShah · 2016-02-18T04:59:12Z

I have been thinking the same too. Its hard to believe that openblas' dgemm is good for a few LAPACK routines and not good for others.

ViralBShah · 2016-02-18T06:39:44Z

For the same a = rand(1500,1500), if I use SYSTEM_BLAS and SYSTEM_LAPACK on mac, I get the same time as octave - 5 seconds, as opposed to 9 seconds with multi-threaded openblas and 14 seconds with single thraded openblas:

julia> @time eig(a);
  4.824313 seconds (23.77 k allocations: 171.501 MB, 1.51% gc time)

All those allocations are perhaps due to type instability, which I believe there are other issues on, but I don't think those are the reasons for the poorer performance with openblas.

andreasnoack · 2016-02-18T14:39:32Z

I think this is OpenBLAS. The Schur factorization is not GEMM heavy. It has a lot of GEMV, GER and some TRMM (as well as a lot of special bulge chasing operations that are not BLAS heavy). Furthermore, many of the operations are on smaller pieces of the matrix and here the difference between VecLib and OpenBLAS can be significant. See e.g.:

julia> size(A)
(50,50)

# OpenBLAS
julia> @time for i = 1:10000; BLAS.gemv!('N', 1.0, A, v, 0.0, similar(v)); end
  0.190648 seconds (10.00 k allocations: 5.188 MB)

# VecLib
julia> @time for i = 1:10000; TestGees.gemv!('N', 1.0, A, v, 0.0, similar(v)); end
  0.006144 seconds (10.00 k allocations: 5.188 MB)

julia> 0.190648/0.006144
31.029947916666668

julia> norm(TestGees.gemv!('N', 1.0, A, v, 0.0, similar(v)) - BLAS.gemv!('N', 1.0, A, v, 0.0, similar(v)))
7.25785747276179e-15

KristofferC · 2016-02-18T14:47:07Z

Your OpenBLAS seems awfully slow. I get:

julia> @time for i = 1:10000; BLAS.gemv!('N', 1.0, A, v, 0.0, similar(v)); end
  0.005647 seconds (10.00 k allocations: 5.188 MB)

andreasnoack · 2016-02-18T14:53:52Z

@KristofferC What is your versioninfo()? Mine is

julia> versioninfo()
Julia Version 0.5.0-dev+2702
Commit 7c9c69f* (2016-02-17 00:38 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.3.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY HASWELL)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1

KristofferC · 2016-02-18T14:58:08Z

julia> versioninfo()
Julia Version 0.5.0-dev+2561
Commit 9091855 (2016-02-09 02:29 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1

andreasnoack · 2016-02-18T20:18:01Z

There is something weird happening on OS X (and maybe Windows, I don't have a machine handy). On a Linux Haswell I get similar timings as you, but we have reproduced the slow timings on two different Macs in the office.

I don't have MKL on the Linux Haswell, but I have on a Linux Westmere machine with MKL and there I don't see a difference between OpenBLAS and MKL for eig. @KristofferC do you have MKL on the Haswell machine? If so, do you see a difference between MKL and OpenBLAS for eig?

xianyi · 2016-02-19T00:59:58Z

@andreasnoack

The similar(v) in the loop means create a new vector on every iteration. Am I right?

andreasnoack · 2016-02-19T01:17:48Z

@xianyi Yes. It creates a new vector to avoid overwriting v.

On Mac OS X, it should use .align 4 (equal to .align 16 on Linux). I didn't get the performance benefit from .align. Thus, I deleted it.

xianyi · 2016-02-19T23:03:42Z

I think I fixed this performance bug on OpenBLAS develop branch.

The .align 16 on Linux should be .align 4 on Mac OSX.

andreasnoack · 2016-02-20T19:15:40Z

@xianyi Great. I see a big improvement for OpenBLAS, but there is still 3.5x difference between OpenBLAS and vecLib. Do you think it is possible to reduce that difference?

julia> @time for i = 1:1000000; BLAS.gemv!('N', 1.0, A, v, 0.0, similar(v)); end
  1.989242 seconds (1000.00 k allocations: 518.799 MB, 0.89% gc time)

julia> @time for i = 1:1000000; Tmp1.gemv!('N', 1.0, A, v, 0.0, similar(v)); end
  0.585461 seconds (1000.00 k allocations: 518.799 MB, 2.82% gc time)

ViralBShah · 2018-05-25T02:54:28Z

@andreasnoack Do we still have this issue with OpenBLAS 0.3.0?

ViralBShah · 2018-05-30T03:30:37Z

This is now faster for me than octave from brew. Julia is about 2.5 seconds and Octave is 3 seconds.

andreasnoack mentioned this issue Feb 8, 2014

Change balancing options for eigen functions. JuliaLang/julia#5731

Merged

timholy mentioned this issue May 23, 2014

RFC: automated BLAS nthreads tuning JuliaLang/julia#5991

Closed

tkelman mentioned this issue Nov 30, 2015

OpenBLAS 4 times slower than MKL on DDOT() OpenMathLib/OpenBLAS#530

Closed

ViralBShah added the performance Must go faster label Feb 17, 2016

ViralBShah removed the system:windows Affects only Windows label Feb 17, 2016

ViralBShah changed the title ~~openblas thread performance on windows~~ Eig is slow compared to mathematica and octave Feb 17, 2016

tkelman mentioned this issue Feb 18, 2016

Schur factorization slow as compared to Octave #309

Closed

andreasnoack mentioned this issue Feb 18, 2016

norm(zeros(129,129)) causes Abort trap: 6 JuliaLang/julia#14507

Closed

xianyi referenced this issue in OpenMathLib/OpenBLAS Feb 19, 2016

Refs JuliaLang/julia#5728. Fix gemv performance bug on Haswell Mac OSX.

6e7be06

On Mac OS X, it should use .align 4 (equal to .align 16 on Linux). I didn't get the performance benefit from .align. Thus, I deleted it.

martin-frbg mentioned this issue Mar 24, 2016

Blaze (in some cases) 2x faster than OpenBLAS OpenMathLib/OpenBLAS#821

Closed

martin-frbg mentioned this issue Jun 3, 2016

slow performance of banded LU factorisation OpenMathLib/OpenBLAS#901

Closed

ViralBShah closed this as completed May 30, 2018

KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eig is slow compared to mathematica and octave #72

Eig is slow compared to mathematica and octave #72

timholy commented Feb 8, 2014

andreasnoack commented Feb 8, 2014

timholy commented Feb 8, 2014

DumpsterDoofus commented Feb 8, 2014

andreasnoack commented Feb 8, 2014

DumpsterDoofus commented Feb 8, 2014

DumpsterDoofus commented Feb 8, 2014

andreasnoack commented Nov 19, 2015

tkelman commented Nov 19, 2015

tkelman commented Nov 19, 2015

andreasnoack commented Nov 19, 2015

tkelman commented Nov 19, 2015

andreasnoack commented Nov 19, 2015

tkelman commented Nov 19, 2015

andreasnoack commented Nov 19, 2015

tkelman commented Nov 19, 2015

tkelman commented Nov 23, 2015

ViralBShah commented Feb 17, 2016

ViralBShah commented Feb 17, 2016

tkelman commented Feb 18, 2016

ViralBShah commented Feb 18, 2016

ViralBShah commented Feb 18, 2016

ViralBShah commented Feb 18, 2016

andreasnoack commented Feb 18, 2016

KristofferC commented Feb 18, 2016

andreasnoack commented Feb 18, 2016

KristofferC commented Feb 18, 2016

andreasnoack commented Feb 18, 2016

xianyi commented Feb 19, 2016

andreasnoack commented Feb 19, 2016

xianyi commented Feb 19, 2016

andreasnoack commented Feb 20, 2016

ViralBShah commented May 25, 2018

ViralBShah commented May 30, 2018 •

edited

Loading

Eig is slow compared to mathematica and octave #72

Eig is slow compared to mathematica and octave #72

Comments

timholy commented Feb 8, 2014

andreasnoack commented Feb 8, 2014

timholy commented Feb 8, 2014

DumpsterDoofus commented Feb 8, 2014

andreasnoack commented Feb 8, 2014

DumpsterDoofus commented Feb 8, 2014

DumpsterDoofus commented Feb 8, 2014

andreasnoack commented Nov 19, 2015

tkelman commented Nov 19, 2015

tkelman commented Nov 19, 2015

andreasnoack commented Nov 19, 2015

tkelman commented Nov 19, 2015

andreasnoack commented Nov 19, 2015

tkelman commented Nov 19, 2015

andreasnoack commented Nov 19, 2015

tkelman commented Nov 19, 2015

tkelman commented Nov 23, 2015

ViralBShah commented Feb 17, 2016

ViralBShah commented Feb 17, 2016

tkelman commented Feb 18, 2016

ViralBShah commented Feb 18, 2016

ViralBShah commented Feb 18, 2016

ViralBShah commented Feb 18, 2016

andreasnoack commented Feb 18, 2016

KristofferC commented Feb 18, 2016

andreasnoack commented Feb 18, 2016

KristofferC commented Feb 18, 2016

andreasnoack commented Feb 18, 2016

xianyi commented Feb 19, 2016

andreasnoack commented Feb 19, 2016

xianyi commented Feb 19, 2016

andreasnoack commented Feb 20, 2016

ViralBShah commented May 25, 2018

ViralBShah commented May 30, 2018 • edited Loading

ViralBShah commented May 30, 2018 •

edited

Loading