LU and eigen routines slower than MKL #2795

youyu3 · 2020-08-26T22:13:57Z

Hi, I'm running some performance comparisons between OpenBLAS and MKL for LU and eigen routines. I see that OpenBLAS tests with, for example, dgetrf and dsyevd, are about 3 times slower than MKL. These are multi-threaded tests ran on a Skylake machine.

I wonder if you have any benchmark results vs. MKL and if so how do they look like?

Thanks.

The text was updated successfully, but these errors were encountered:

brada4 · 2020-08-27T06:22:08Z

What skylake? What size of inputs? Which OpenBLAS version? Any virtualisation?
Is DGEMM slower too? (i.e if latest improves/fixes any)

martin-frbg · 2020-08-27T07:38:17Z

Skylake (Haswell refresh without AVX512) or SkylakeX (with AVX512) ? DGEMM performance for the latter should be about on par with MKL if you use a very recent 0.3.x release or the current develop branch. At small problem sizes MKL will probably be faster as OpenBLAS only has a single threshold for switching from using a single thread to using all available cores, while MKL seems to increase the thread count more gradually to match the workload. (Also MKL may be using a more efficient LAPACK than the "netlib" reference implementation)

youyu3 · 2020-08-27T15:56:09Z

Thanks. It is SkylakeX with AVX-512. Tried input sizes from m=n=5000 to m=n=15000. OpenBLAS 0.3.7 - this should have latest improvements? Can create a plot and profile DGEMM.

martin-frbg · 2020-08-27T16:45:54Z

0.3.7 (from a year ago) had all parts of the initial AVX512 DGEMM implementation disabled as it turned out to be incorrect. AVX512 DGEMM reappeared in 0.3.8 and was further improved in 0.3.10, so ideally you should be trying that (or git develop)

youyu3 · 2020-08-27T16:48:20Z

Good to know, thanks! I'll try 0.3.10 and report back.

martin-frbg · 2020-08-27T19:27:19Z

there is some data in PR #2646 (and test code linked in #2286)

youyu3 · 2020-08-28T06:22:20Z

Well, it turns out that I was using 0.3.10. But I have some more observations as shown in the below plots. dgetrf tests (left panel) were run with a 5000x5000 matrix, and dgemm tests (right panel) were run with two 5000x5000 matrices. Plotted are elapsed time inversed in inverse-seconds to better show the scalability.

For dgemm, OpenBLAS and MKL have about the same performance for any given number of threads;
For dgetrf, OpenBLAS is only slightly (8%) slower in the single thread case, but its scalability is much worse with increasing numbers of threads.

I can run a profiler on dgetrf and find out which part doesn't scale.

martin-frbg · 2020-08-28T06:52:54Z

Interesting, thanks - GETRF is one of the few LAPACK functions that are reimplemented (lapack/getrf/getrf_parallel.c, already in the original GotoBLAS) rather than copied from the reference implementation. There were some fixes to my previous heavy-handed approach to making it thread-safe in february, perhaps there is more wrong with it. Also the DTRSM it calls is not optimized for SkylakeX (and neither is LASWP, another reimplemented function).

youyu3 · 2020-08-28T15:52:16Z

Yes I noticed that OpenBLAS DGETRF is much faster than the netlib implementation, but as we see here is still not as fast as MKL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LU and eigen routines slower than MKL #2795

LU and eigen routines slower than MKL #2795

youyu3 commented Aug 26, 2020

brada4 commented Aug 27, 2020 •

edited

Loading

martin-frbg commented Aug 27, 2020

youyu3 commented Aug 27, 2020

martin-frbg commented Aug 27, 2020

youyu3 commented Aug 27, 2020

martin-frbg commented Aug 27, 2020

youyu3 commented Aug 28, 2020

martin-frbg commented Aug 28, 2020 •

edited

Loading

youyu3 commented Aug 28, 2020

LU and eigen routines slower than MKL #2795

LU and eigen routines slower than MKL #2795

Comments

youyu3 commented Aug 26, 2020

brada4 commented Aug 27, 2020 • edited Loading

martin-frbg commented Aug 27, 2020

youyu3 commented Aug 27, 2020

martin-frbg commented Aug 27, 2020

youyu3 commented Aug 27, 2020

martin-frbg commented Aug 27, 2020

youyu3 commented Aug 28, 2020

martin-frbg commented Aug 28, 2020 • edited Loading

youyu3 commented Aug 28, 2020

brada4 commented Aug 27, 2020 •

edited

Loading

martin-frbg commented Aug 28, 2020 •

edited

Loading