Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LU and eigen routines slower than MKL #2795

Open
youyu3 opened this issue Aug 26, 2020 · 9 comments
Open

LU and eigen routines slower than MKL #2795

youyu3 opened this issue Aug 26, 2020 · 9 comments

Comments

@youyu3
Copy link

youyu3 commented Aug 26, 2020

Hi, I'm running some performance comparisons between OpenBLAS and MKL for LU and eigen routines. I see that OpenBLAS tests with, for example, dgetrf and dsyevd, are about 3 times slower than MKL. These are multi-threaded tests ran on a Skylake machine.

I wonder if you have any benchmark results vs. MKL and if so how do they look like?

Thanks.

@brada4
Copy link
Contributor

brada4 commented Aug 27, 2020

What skylake? What size of inputs? Which OpenBLAS version? Any virtualisation?
Is DGEMM slower too? (i.e if latest improves/fixes any)

@martin-frbg
Copy link
Collaborator

Skylake (Haswell refresh without AVX512) or SkylakeX (with AVX512) ? DGEMM performance for the latter should be about on par with MKL if you use a very recent 0.3.x release or the current develop branch. At small problem sizes MKL will probably be faster as OpenBLAS only has a single threshold for switching from using a single thread to using all available cores, while MKL seems to increase the thread count more gradually to match the workload. (Also MKL may be using a more efficient LAPACK than the "netlib" reference implementation)

@youyu3
Copy link
Author

youyu3 commented Aug 27, 2020

Thanks. It is SkylakeX with AVX-512. Tried input sizes from m=n=5000 to m=n=15000. OpenBLAS 0.3.7 - this should have latest improvements? Can create a plot and profile DGEMM.

@martin-frbg
Copy link
Collaborator

0.3.7 (from a year ago) had all parts of the initial AVX512 DGEMM implementation disabled as it turned out to be incorrect. AVX512 DGEMM reappeared in 0.3.8 and was further improved in 0.3.10, so ideally you should be trying that (or git develop)

@youyu3
Copy link
Author

youyu3 commented Aug 27, 2020

Good to know, thanks! I'll try 0.3.10 and report back.

@martin-frbg
Copy link
Collaborator

there is some data in PR #2646 (and test code linked in #2286)

@youyu3
Copy link
Author

youyu3 commented Aug 28, 2020

Well, it turns out that I was using 0.3.10. But I have some more observations as shown in the below plots. dgetrf tests (left panel) were run with a 5000x5000 matrix, and dgemm tests (right panel) were run with two 5000x5000 matrices. Plotted are elapsed time inversed in inverse-seconds to better show the scalability.

  1. For dgemm, OpenBLAS and MKL have about the same performance for any given number of threads;
  2. For dgetrf, OpenBLAS is only slightly (8%) slower in the single thread case, but its scalability is much worse with increasing numbers of threads.

dgetrf-dgemm-performance

I can run a profiler on dgetrf and find out which part doesn't scale.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 28, 2020

Interesting, thanks - GETRF is one of the few LAPACK functions that are reimplemented (lapack/getrf/getrf_parallel.c, already in the original GotoBLAS) rather than copied from the reference implementation. There were some fixes to my previous heavy-handed approach to making it thread-safe in february, perhaps there is more wrong with it. Also the DTRSM it calls is not optimized for SkylakeX (and neither is LASWP, another reimplemented function).

@youyu3
Copy link
Author

youyu3 commented Aug 28, 2020

Yes I noticed that OpenBLAS DGETRF is much faster than the netlib implementation, but as we see here is still not as fast as MKL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants