Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eig is slow compared to mathematica and octave #72

Closed
timholy opened this issue Feb 8, 2014 · 33 comments
Closed

Eig is slow compared to mathematica and octave #72

timholy opened this issue Feb 8, 2014 · 33 comments
Labels
performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@timholy
Copy link
Member

timholy commented Feb 8, 2014

On StackOverflow a user pointed out that eig seems slow compared to Mathematica: http://stackoverflow.com/questions/21641621/eigendecompositions-are-5-times-slower-in-julia-than-in-mathematica

I compared against Matlab (I don't have Mathematica), and it's about a factor of 2 slower. I profiled it, and it indicates that all the time is in linalg/lapack.jl; geevx!; line: 1209. I briefly read the docs for dgeevx, and I confess I don't understand why that ccall is being made twice.

Moreover, eig doesn't accept keywords, but ?eig suggests it should take a balance keyword.

@andreasnoack
Copy link
Member

On my Mac Julia is the faster of the two. A guess could be the he uses reference BLAS and LAPACK for Julia but haven't found a way to make Mathematica do the same.

The first ccall determines the correct size for the work space matrices and the second does the calculation.

Regarding the keywords, it is my mistake. I am preparing pull request. I also forgot eigvals.

@timholy
Copy link
Member Author

timholy commented Feb 8, 2014

Thanks for explaining re the ccall, and for the upcoming doc fixes.

@DumpsterDoofus
Copy link

So I asked MathematicaSE and apparently Windows Mathematica uses the Intel MKL BLAS. I am using Julia 0.2.0 with the OpenBLAS architecture, which seems 3 to 6 times slower than Mathematica on my machine. Is this reasonable?

If I change Julia to use the Intel MKL, will it speed up? Or are more complicated things at play?

@andreasnoack
Copy link
Member

The performance of OpenBLAS depends on the architecture, but it seems that something is not right on Windows. I have just tried your example on a Windows server and the result depends on the number of threads used. Unfortunately, in the wrong direction. I get

julia> blas_set_num_threads(1)
julia> @time (E,F)=eig(D);
elapsed time: 3.576688676 seconds (80089088 bytes allocated)
julia> blas_set_num_threads(2)
julia> @time (E,F)=eig(D);
elapsed time: 9.650217604 seconds (80089088 bytes allocated)
julia> blas_set_num_threads(3)
julia> @time (E,F)=eig(D);
elapsed time: 10.071440396 seconds (80089088 bytes allocated)

On the same server Mathematica spends 5-6 seconds.

If you use MKL when compiling julia, I think that you should expect timings comparable to Mathematica.

My info is

Julia Version 0.3.0-prerelease+1400
Commit 6f3a4b6* (2014-02-05 19:14 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

cc.@xianyi

@DumpsterDoofus
Copy link

@andreasnoack: That's a strange issue, and I can reproduce it on my machine:

D=1000*(rand(1000,1000)-0.5);
blas_set_num_threads(1)
@time (E,F)=eig(D);
blas_set_num_threads(1)
@time (E,F)=eig(D);
blas_set_num_threads(2)
@time (E,F)=eig(D);
blas_set_num_threads(2)
@time (E,F)=eig(D);
blas_set_num_threads(3)
@time (E,F)=eig(D);
blas_set_num_threads(3)
@time (E,F)=eig(D);

Output:

elapsed time: 1.770069354 seconds (79719608 bytes allocated)
elapsed time: 1.705695403 seconds (79719608 bytes allocated)
elapsed time: 6.403318095 seconds (79719608 bytes allocated)
elapsed time: 6.3963143 seconds (79719608 bytes allocated)
elapsed time: 6.997508242 seconds (79719608 bytes allocated)
elapsed time: 7.00552684 seconds (79719608 bytes allocated)

Is this some kind of bug? Should it be faster when there are more threads dedicated? Also, the above timings are really strange! Before I changed the number of threads, it was routinely taking 7.5 seconds to compute! Let me play around a little more, and see what I can find...

@DumpsterDoofus
Copy link

Okay, I restarted the kernel and once again ran

D=1000*(rand(1000,1000)-0.5);
@time (E,F)=eig(D);

and it took 7.4 seconds on average. After setting blas_set_num_threads(1) and executing, it took 1.7 seconds on average, which is roughly what I get on Mathematica. So something is definitely wrong with the default settings for BLAS, or at least there's a problem on my and andreasnoackjensen's builds of Julia.

So I don't know much about Julia, but is there a way to determine the default number of BLAS threads used by the kernel?

@andreasnoack
Copy link
Member

Would someone with Windows access please try and see if this is still an issue?

@tkelman
Copy link

tkelman commented Nov 19, 2015

worse, actually - now it segfaults:

julia> versioninfo()
Julia Version 0.4.2-pre+12
Commit 9e598b6* (2015-11-19 02:34 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> D=1000*(rand(1000,1000)-0.5);

julia> blas_set_num_threads(1)

julia> @time (E,F)=eig(D);

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x4fbcec0 -- openblas_get_num_threads64_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
openblas_get_num_threads64_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
dgemv64_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
dlahr264_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
DGEHRD64_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
DGEEVX64_ at D:\cygwin64\home\Tony\julia\usr\bin\libopenblas64_.DLL (unknown line)
geevx! at linalg/lapack.jl:1726
eigfact! at linalg/eigen.jl:32
julia_eigfact!_1551 at  (unknown line)
eigfact at linalg/eigen.jl:57
eig at linalg/eigen.jl:66
julia_eig_1546 at  (unknown line)
jl_apply_generic at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_interpret_toplevel_expr at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_interpret_toplevel_thunk_with at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_interpret_toplevel_thunk_with at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_interpret_toplevel_expr_in at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_interpret_toplevel_thunk_with at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_eval_with_compiler_p at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
jl_f_tuple at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
eval_user_input at REPL.jl:62
jlcall_eval_user_input_1308 at  (unknown line)
jl_apply_generic at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)
anonymous at REPL.jl:92
jl_switchto at D:\cygwin64\home\Tony\julia\usr\bin\libjulia.dll (unknown line)

@tkelman
Copy link

tkelman commented Nov 19, 2015

The backtrace might be a bit wrong there, in gdb it looks like this:

julia> @time (E,F)=eig(D);

Program received signal SIGSEGV, Segmentation fault.
0x0000000004a4cec0 in blas_memory_free () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
(gdb) bt
#0  0x0000000004a4cec0 in blas_memory_free () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
JuliaLang/julia#1  0x0000000004800738 in dgemv_ () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
JuliaLang/julia#2  0x000000000622957b in dlahr2_ () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
JuliaLang/julia#3  0x00000000061e96e9 in dgehrd_ () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
JuliaLang/julia#4  0x00000000061e477a in dgeevx_ () from /home/Tony/julia/usr/bin/libopenblas64_.DLL
JuliaLang/julia#5  0x000000000e57cb9f in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

@andreasnoack
Copy link
Member

Do you only get the segfault when running single threaded?

@tkelman
Copy link

tkelman commented Nov 19, 2015

No, segfaults with blas_set_num_threads(4) too.

@andreasnoack
Copy link
Member

It's a bit surprising that no users have complained about this.

@tkelman
Copy link

tkelman commented Nov 19, 2015

And that the tests didn't show this as soon as we upgraded to 0.2.15. I also tried with an 0.4 rc that was using 0.2.14 and the performance looked similar to what was reported above. For @xianyi's benefit, can we come up with standalone test cases in C or Fortran that would call dgeevx64_ equivalently? Here's how the openblas library was built:

make -C openblas CC=x86_64-w64-mingw32-gcc -march=x86-64 -m64 FC=x86_64-w64-mingw32-gfortran -march=x86-64 -m64 RANLIB=x86_64-w64-mingw32-ranlib FFLAGS= -O2  TARGET= BINARY=64 USE_THREAD=1 GEMM_MULTITHREADING_THRESHOLD=50 NUM_THREADS=16 NO_AFFINITY=1 DYNAMIC_ARCH=1 INTERFACE64=1 SYMBOLSUFFIX=64_ LIBPREFIX=libopenblas64_ OSNAME=WINNT CROSS=1 HOSTCC=gcc

 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... WINNT
  Architecture     ... x86_64
  BINARY           ... 64bit
  Use 64 bits int    (equivalent to "-i8" in Fortran)
  C compiler       ... GCC  (command line : x86_64-w64-mingw32-gcc -march=x86-64 -m64)
  Fortran compiler ... GFORTRAN  (command line : x86_64-w64-mingw32-gfortran -march=x86-64 -m64)
  Library Name     ... libopenblas64_p-r0.2.15.a (Multi threaded; Max num-threads is 16)

@andreasnoack
Copy link
Member

For @xianyi's benefit, can we come up with standalone test cases in C or Fortran that would call dgeevx64_ equivalently?

Yes, we should do that, but it's a bit of a pain because it's an "expert driver routine" and therefore takes 23 arguments.

@tkelman
Copy link

tkelman commented Nov 19, 2015

We should totally write an automatic test-case-generator that converts a given ccall invocation into standalone C. https://xkcd.com/974

@tkelman
Copy link

tkelman commented Nov 23, 2015

Reported the segfault upstream at OpenMathLib/OpenBLAS#697 - the simplified dgeev was enough to make this happen (64 bit Cygwin's Octave hits the same segfault, and they use dgeev rather than dgeevx for eig).

@ViralBShah
Copy link
Member

On my mac, this takes 15 seconds on eig(rand(1500,1500)) with one BLAS thread, and 9 seconds if I use the default multi-threading in openblas. Octave's [d,v] = eig(rand(1500,1500)) takes 5 seconds, where I am using Octave from homebrew.

@ViralBShah ViralBShah added the performance Must go faster label Feb 17, 2016
@ViralBShah
Copy link
Member

Removing the windows label since this is the case on mac as well. @andreasnoack Can you verify once more?

@ViralBShah ViralBShah removed the system:windows Affects only Windows label Feb 17, 2016
@ViralBShah ViralBShah changed the title openblas thread performance on windows Eig is slow compared to mathematica and octave Feb 17, 2016
@tkelman
Copy link

tkelman commented Feb 18, 2016

I have a theory. This might not be an openblas-vs-mkl problem, it might be a gfortran-vs-ifort problem. Can anyone on Linux or OSX who has access to Intel compilers try building openblas using icc and ifort, and compare that to MKL/Accelerate?

@ViralBShah
Copy link
Member

Cc @ranjanan

@ViralBShah
Copy link
Member

I have been thinking the same too. Its hard to believe that openblas' dgemm is good for a few LAPACK routines and not good for others.

@ViralBShah
Copy link
Member

For the same a = rand(1500,1500), if I use SYSTEM_BLAS and SYSTEM_LAPACK on mac, I get the same time as octave - 5 seconds, as opposed to 9 seconds with multi-threaded openblas and 14 seconds with single thraded openblas:

julia> @time eig(a);
  4.824313 seconds (23.77 k allocations: 171.501 MB, 1.51% gc time)

All those allocations are perhaps due to type instability, which I believe there are other issues on, but I don't think those are the reasons for the poorer performance with openblas.

@andreasnoack
Copy link
Member

I think this is OpenBLAS. The Schur factorization is not GEMM heavy. It has a lot of GEMV, GER and some TRMM (as well as a lot of special bulge chasing operations that are not BLAS heavy). Furthermore, many of the operations are on smaller pieces of the matrix and here the difference between VecLib and OpenBLAS can be significant. See e.g.:

julia> size(A)
(50,50)

# OpenBLAS
julia> @time for i = 1:10000; BLAS.gemv!('N', 1.0, A, v, 0.0, similar(v)); end
  0.190648 seconds (10.00 k allocations: 5.188 MB)

# VecLib
julia> @time for i = 1:10000; TestGees.gemv!('N', 1.0, A, v, 0.0, similar(v)); end
  0.006144 seconds (10.00 k allocations: 5.188 MB)

julia> 0.190648/0.006144
31.029947916666668

julia> norm(TestGees.gemv!('N', 1.0, A, v, 0.0, similar(v)) - BLAS.gemv!('N', 1.0, A, v, 0.0, similar(v)))
7.25785747276179e-15

@KristofferC
Copy link
Member

Your OpenBLAS seems awfully slow. I get:

julia> @time for i = 1:10000; BLAS.gemv!('N', 1.0, A, v, 0.0, similar(v)); end
  0.005647 seconds (10.00 k allocations: 5.188 MB)

@andreasnoack
Copy link
Member

@KristofferC What is your versioninfo()? Mine is

julia> versioninfo()
Julia Version 0.5.0-dev+2702
Commit 7c9c69f* (2016-02-17 00:38 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.3.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY HASWELL)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1

@KristofferC
Copy link
Member

julia> versioninfo()
Julia Version 0.5.0-dev+2561
Commit 9091855 (2016-02-09 02:29 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1

@andreasnoack
Copy link
Member

There is something weird happening on OS X (and maybe Windows, I don't have a machine handy). On a Linux Haswell I get similar timings as you, but we have reproduced the slow timings on two different Macs in the office.

I don't have MKL on the Linux Haswell, but I have on a Linux Westmere machine with MKL and there I don't see a difference between OpenBLAS and MKL for eig. @KristofferC do you have MKL on the Haswell machine? If so, do you see a difference between MKL and OpenBLAS for eig?

@xianyi
Copy link

xianyi commented Feb 19, 2016

@andreasnoack

The similar(v) in the loop means create a new vector on every iteration. Am I right?

@andreasnoack
Copy link
Member

@xianyi Yes. It creates a new vector to avoid overwriting v.

xianyi referenced this issue in OpenMathLib/OpenBLAS Feb 19, 2016
On Mac OS X, it should use .align 4 (equal to .align 16 on Linux).
I didn't get the performance benefit from .align. Thus, I deleted it.
@xianyi
Copy link

xianyi commented Feb 19, 2016

I think I fixed this performance bug on OpenBLAS develop branch.

The .align 16 on Linux should be .align 4 on Mac OSX.

@andreasnoack
Copy link
Member

@xianyi Great. I see a big improvement for OpenBLAS, but there is still 3.5x difference between OpenBLAS and vecLib. Do you think it is possible to reduce that difference?

julia> @time for i = 1:1000000; BLAS.gemv!('N', 1.0, A, v, 0.0, similar(v)); end
  1.989242 seconds (1000.00 k allocations: 518.799 MB, 0.89% gc time)

julia> @time for i = 1:1000000; Tmp1.gemv!('N', 1.0, A, v, 0.0, similar(v)); end
  0.585461 seconds (1000.00 k allocations: 518.799 MB, 2.82% gc time)

@ViralBShah
Copy link
Member

@andreasnoack Do we still have this issue with OpenBLAS 0.3.0?

@ViralBShah
Copy link
Member

ViralBShah commented May 30, 2018

This is now faster for me than octave from brew. Julia is about 2.5 seconds and Octave is 3 seconds.

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

7 participants