-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAPACKE_dsytrd
performance degradation with multiple threads on Windows
#3801
Comments
Could you try building OpenBLAS with Another try is looking at call graph In the meantime I am trying to put together julia with predictabe BLAS version included. |
syr2 with small n is already transformed to a single-threaded axpy loop, but syr2k does indeed lack a lower threshold. |
I took 1st test from linked julia ticket, 2-cpu version is 20% slower, time shows significant time un syscalls, and perf report is dominated by pthread_mutex_(un)lock and __sched_yield appears somewhere further. |
I've recompiled OpenBLAS in MSYS2 using GCC 12.2.0 and $ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test.exe 50 1000
sytrd:
Average of 1000 runs for a 50x50 matrix = 77.4 us
zhetrd:
Average of 1000 runs for a 50x50 matrix = 140.0 us
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test.exe 50 1000
sytrd:
Average of 1000 runs for a 50x50 matrix = 1178.5 us
zhetrd:
Average of 1000 runs for a 50x50 matrix = 195.2 us For 500x500 matrices: $ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test.exe 500 100
sytrd:
Average of 100 runs for a 500x500 matrix = 7139.4 us
zhetrd:
Average of 100 runs for a 500x500 matrix = 31313.9 us
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test.exe 500 100
sytrd:
Average of 100 runs for a 500x500 matrix = 28348.2 us
Segmentation fault I also now get a segfault in |
Not certainly compiler-related. What I suspected in other issue is that For segfault debugging you can use x64dbg (or tame mingw gdb) to see call stack at the moment of crash, if the faulty function comes from openblas.dll or your main() .... Could you change interface/syr2k.c to Please compare same build to same build with one file change. As much as interface/ files are concerned you can re-run make in toplevel to update few functions derived from one file without recompiling the rest. |
Added |
The original Julia ticket seems to stress that the slowdown is a Windows-only issue, not observed on either OSX or Linux. So more likely to be a poor implementation of threading (in particular locking ?) in blas_server_win32.c, or our old friend the YIELDING macro in common.h |
regarding the crash, did you build 0.3.21 or current git |
I built the released 0.3.21 version (commit |
Could you build with |
Sure! Should I set |
setting it in COMMON_OPT should be safest |
Maybe windoes issue is more prominent, but there is slight slowdown with excess cpu frying on linux pthreads too. Within lost turbo range. But I see mote gold sipping once you start diging. |
Rebuilt the release version with $ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test.exe 500 100
dsytrd:
Average of 100 runs for a 500x500 matrix = 7386.9 us
zhetrd:
Average of 100 runs for a 500x500 matrix = 32198.4 us
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test.exe 500 100
dsytrd:
Average of 100 runs for a 500x500 matrix = 29271.0 us
zhetrd:
Average of 100 runs for a 500x500 matrix = 61499.5 us |
You should help with C reptoducer. We are out of surface guessrs. |
The C reproducer is here — I've mentioned it in the first comment but let me know if you mean something else. |
That must be busy poll(s)? Just that so short that |
Did a quick test run yesterday, replacing the implementation of YIELDING with |
I tried various sleeps. They are being called in a tight loop. usleep 1 adds slightly to total wall time but reduces system to invidible |
that's back to the old discussion if it should be a busy loop or not, but of no particular effect on this issue (why small-size dsytrd suffers from multithreading more than zhetrd) ? |
profile is with blanks, i dare not to guess. |
I tried the mentioned
Windows 10 OS 64-bit |
@carlkl did you build with |
No, but compiling the test program with |
Or |
Neither of these should "lead to" segfaults, but the tree vectorizer is/was at least a known source of problems with gcc12. |
You need to manipulate "global" setting in COMMON_OPT in Makefile.rule, and run |
(This follows from JuliaLang/LinearAlgebra.jl#960)
On Windows,
LAPACKE_dsytrd
seems to slow down considerably in multi-threaded mode compared to single-threaded. The test code is here. For a 50x50 matrix, the results are (Windows 10 machine with i7-6700K (4 cores, 8 threads)):I've also included
LAPACKE_zhetrd
for comparison; it is apparent that in the 4-threaded casedsytrd
is even slower thanzhetrd
, which seems counterintuitive.For larger matrices,
dsytrd
becomes faster thanzhetrd
, but multithreding still leads to a slowdown:I could reproduce these results on four different Windows 10 machines (brief system info here). I used OpenBLAS 0.3.21 which I compiled using GCC 7.2.0 and cmake (with the default options).
For comparison, on an intel Mac (macOS 10.14.6, i5-5250U (2 cores, 4 threads)) there is only a small performance penalty (if at all) when manipulating small matrices in parallel, while for larger matrices multithreading boosts performance:
This uses OpenBLAS 0.3.21 compiled using clang from Apple LLVM 10.0.1 and GNU Fortran (GCC) 8.2.0 (for building LAPACK); cmake (with the default options).
The text was updated successfully, but these errors were encountered: