Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAPACKE_dsytrd performance degradation with multiple threads on Windows #3801

Closed
yakovbraver opened this issue Oct 30, 2022 · 28 comments · Fixed by #3953
Closed

LAPACKE_dsytrd performance degradation with multiple threads on Windows #3801

yakovbraver opened this issue Oct 30, 2022 · 28 comments · Fixed by #3953

Comments

@yakovbraver
Copy link

(This follows from JuliaLang/LinearAlgebra.jl#960)
On Windows, LAPACKE_dsytrd seems to slow down considerably in multi-threaded mode compared to single-threaded. The test code is here. For a 50x50 matrix, the results are (Windows 10 machine with i7-6700K (4 cores, 8 threads)):

> set OPENBLAS_NUM_THREADS=1
> lapack-test.exe 50 1000
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 124.9 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 203.1 us
> set OPENBLAS_NUM_THREADS=4
> lapack-test.exe 50 1000
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 1296.5 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 249.9 us

I've also included LAPACKE_zhetrd for comparison; it is apparent that in the 4-threaded case dsytrd is even slower than zhetrd, which seems counterintuitive.
For larger matrices, dsytrd becomes faster than zhetrd, but multithreding still leads to a slowdown:

> set OPENBLAS_NUM_THREADS=1
> lapack-test.exe 500 100
dsytrd:
 Average of 100 runs for a 500x500 matrix = 10831.1 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 32492.4 us
> set OPENBLAS_NUM_THREADS=4
> lapack-test.exe 500 100
dsytrd:
 Average of 100 runs for a 500x500 matrix = 30951.7 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 59309.2 us

I could reproduce these results on four different Windows 10 machines (brief system info here). I used OpenBLAS 0.3.21 which I compiled using GCC 7.2.0 and cmake (with the default options).

For comparison, on an intel Mac (macOS 10.14.6, i5-5250U (2 cores, 4 threads)) there is only a small performance penalty (if at all) when manipulating small matrices in parallel, while for larger matrices multithreading boosts performance:

$ export OPENBLAS_NUM_THREADS=4    
$ ./lapack-test 50 1000                  
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 229.2 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 315.5 us
$ export OPENBLAS_NUM_THREADS=1    
$ ./lapack-test 50 1000                  
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 180.8 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 288.8 us

$ export OPENBLAS_NUM_THREADS=4    
$ ./lapack-test 500 100                  
dsytrd:
 Average of 100 runs for a 500x500 matrix = 14451.2 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 42615.3 us
$ export OPENBLAS_NUM_THREADS=1    
$ ./lapack-test 500 100                                                            
dsytrd:
 Average of 100 runs for a 500x500 matrix = 19221.5 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 55341.5 us

This uses OpenBLAS 0.3.21 compiled using clang from Apple LLVM 10.0.1 and GNU Fortran (GCC) 8.2.0 (for building LAPACK); cmake (with the default options).

@brada4
Copy link
Contributor

brada4 commented Oct 30, 2022

Could you try building OpenBLAS with make as there is some suspicion that cmake -O3 in JuliaLang/julia#3787 optimized out threading threshold at least in first sight at loose backtraces.

Another try is looking at call graph
syr2 and syr2k do not have threading threshold.
Try undef-ing SMP in those files in interface/ if it is not too tough experimentation.

In the meantime I am trying to put together julia with predictabe BLAS version included.

@martin-frbg
Copy link
Collaborator

syr2 with small n is already transformed to a single-threaded axpy loop, but syr2k does indeed lack a lower threshold.

@brada4
Copy link
Contributor

brada4 commented Oct 30, 2022

I took 1st test from linked julia ticket, 2-cpu version is 20% slower, time shows significant time un syscalls, and perf report is dominated by pthread_mutex_(un)lock and __sched_yield appears somewhere further.
With all -g's everywhere still backtraces make not much sense to attribute BLAS function at fault.

@yakovbraver
Copy link
Author

I've recompiled OpenBLAS in MSYS2 using GCC 12.2.0 and make. Execution time is better now (perhaps also because of the newer compiler), but the problem persists. On the same Windows 10, i7-6700K machine I now get for 50x50 matrices the following:

$ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test.exe 50 1000
sytrd:
 Average of 1000 runs for a 50x50 matrix = 77.4 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 140.0 us
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test.exe 50 1000
sytrd:
 Average of 1000 runs for a 50x50 matrix = 1178.5 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 195.2 us

For 500x500 matrices:

$ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test.exe 500 100
sytrd:
 Average of 100 runs for a 500x500 matrix = 7139.4 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 31313.9 us
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test.exe 500 100
sytrd:
 Average of 100 runs for a 500x500 matrix = 28348.2 us
Segmentation fault

I also now get a segfault in zhetrd for matrix size N >= 256, although that might be a problem with my test code.

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2022

Not certainly compiler-related. What I suspected in other issue is that if input > 1mb nthreads=max else nthreads=1 is optimized out by -O3 implied by "cmake defaults" so you unconditionally get SMP codepath.

For segfault debugging you can use x64dbg (or tame mingw gdb) to see call stack at the moment of crash, if the faulty function comes from openblas.dll or your main() ....

Could you change interface/syr2k.c to #undef SMP early to skip all SMP code in that file so that unnecessary parallelism does not kick in for small matrices? That would sort of reduce test case to one single source file if small change regains 50x50 performance.

Please compare same build to same build with one file change. As much as interface/ files are concerned you can re-run make in toplevel to update few functions derived from one file without recompiling the rest.

@yakovbraver
Copy link
Author

Added #undef SMP to line 45 of interface/syr2k.c, re-run make in the top level and recompiled my test code. However, the results are identical to those in my previous post, including the segfault (which I will try to debug).

@yakovbraver
Copy link
Author

I've attached MinGW gdb to the program, and the stack trace seems to point to zgemv causing the segfault:
image

@martin-frbg
Copy link
Collaborator

The original Julia ticket seems to stress that the slowdown is a Windows-only issue, not observed on either OSX or Linux. So more likely to be a poor implementation of threading (in particular locking ?) in blas_server_win32.c, or our old friend the YIELDING macro in common.h

@martin-frbg
Copy link
Collaborator

regarding the crash, did you build 0.3.21 or current git develop ?

@yakovbraver
Copy link
Author

I built the released 0.3.21 version (commit b89fb70).

@martin-frbg
Copy link
Collaborator

Could you build with -fno-tree-vectorize then, please ? (PR JuliaLang/julia#3745 , see the currently open Milestone where I track the non-trivial changes since last release)

@yakovbraver
Copy link
Author

Sure! Should I set COMMON_OPT = -O2 -fno-tree-vectorize in Makefile.rule or should I set this flag elsewhere?
https://github.com/xianyi/OpenBLAS/blob/65338a9493a1dd4860ac021db47d97468180f46c/Makefile.rule#L228-L231

@martin-frbg
Copy link
Collaborator

setting it in COMMON_OPT should be safest

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2022

Maybe windoes issue is more prominent, but there is slight slowdown with excess cpu frying on linux pthreads too. Within lost turbo range. But I see mote gold sipping once you start diging.

@yakovbraver
Copy link
Author

Rebuilt the release version with -fno-tree-vectorize and the segfault is gone. For the record, the timing for 500x500 matrices is now

$ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test.exe 500 100
dsytrd:
 Average of 100 runs for a 500x500 matrix = 7386.9 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 32198.4 us
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test.exe 500 100
dsytrd:
 Average of 100 runs for a 500x500 matrix = 29271.0 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 61499.5 us

@brada4
Copy link
Contributor

brada4 commented Oct 31, 2022

You should help with C reptoducer. We are out of surface guessrs.

@yakovbraver
Copy link
Author

The C reproducer is here — I've mentioned it in the first comment but let me know if you mean something else.

@brada4
Copy link
Contributor

brada4 commented Nov 4, 2022

That must be busy poll(s)? Just that so short that perf does not catch up, julia or C, does not matter, neither gets to pthread_yield calling hotsport.
`while (job[mypos].working[i][CACHE_LINE_SIZE * bufferside]) {YIELDING;};
One quick approach would be pushing threading threshold much further than it is so that inevitable busy loop does not add to execution time.
It is not quick few-liner with pthread_barrier_wait as the number of barriers to break through is not known when iterating through quickdivide. I.e all those places need to be ordered - first set all thread init conditions, then init barrier then schedule work pieces that end in _wait, then drop barrier.

@martin-frbg
Copy link
Collaborator

Did a quick test run yesterday, replacing the implementation of YIELDING with sleep(0) (as suggested in some older threads on the net) but did not observe any marked difference. One other possibility could be that my use of a CRITICAL_SECTION instead of a mutex in level3_thread.c (from PR JuliaLang/julia#1875 four years ago) puts Windows multithreading at a disadvantage. Still neither would explain why it affects dsytrd more than zhetrd.

@brada4
Copy link
Contributor

brada4 commented Nov 5, 2022

I tried various sleeps. They are being called in a tight loop. usleep 1 adds slightly to total wall time but reduces system to invidible

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 5, 2022

that's back to the old discussion if it should be a busy loop or not, but of no particular effect on this issue (why small-size dsytrd suffers from multithreading more than zhetrd) ?

@brada4
Copy link
Contributor

brada4 commented Nov 5, 2022

profile is with blanks, i dare not to guess.

@carlkl
Copy link

carlkl commented Nov 8, 2022

I tried the mentioned lapack-test.c on Windows with a locally compiled trunk version of OpenBLAS and got this:

$ catchsegv.exe ./lapack-test.exe 500 100
lapack-test.exe caused an Access Violation at location 00007FF6998D2379 in module lapack-test.exe Reading from location 00000916ACE56FE0.

AddrPC           Params
00007FF6998D2379 0000000000000055 00007FF6998D5FF4 0000000000200000  lapack-test.exe!LAPACKE_dtr_nancheck+0x199
00007FF6998D2184 0000027EACB560B0 00007FFEE4940800 0000000000000001  lapack-test.exe!LAPACKE_dsy_nancheck+0x24
00007FF6998D18C4 00000000000001F3 0000000000000004 00007FFFFFFEFFFF  lapack-test.exe!LAPACKE_dsytrd+0xc4
00007FF6998D16A1 0000000000000010 00007FF69A6053B8 00007FF69A6053F0  lapack-test.exe!main+0x171  [C:/dev/home/tmp/lapack_test/lapack-test.c @ 97]
    95: gettimeofday(&start, NULL);
    96:
>   97: for (int j = 0; j < loops; j++) {
    98: LAPACKE_dsytrd(LAPACK_COL_MAJOR, 'U', N, A_double, N, d, e, tau_double);
    99: }
00007FF6998D13AE 0000000000000000 0000000000000000 0000000000000000  lapack-test.exe!__tmainCRTStartup+0x22e  [C:/M/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c @ 329]
00007FF6998D14E6 0000000000000000 0000000000000000 0000000000000000  lapack-test.exe!mainCRTStartup+0x16  [C:/M/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c @ 206]
00007FFEE4197034 0000000000000000 0000000000000000 0000000000000000  KERNEL32.DLL!BaseThreadInitThunk+0x14
00007FFEE49426A1 0000000000000000 0000000000000000 0000000000000000  ntdll.dll!RtlUserThreadStart+0x21

Segmentation fault

Windows 10 OS 64-bit
MSYS2 gcc-12
OpenBLAS-5d02f2e

@martin-frbg
Copy link
Collaborator

@carlkl did you build with -fno-tree-vectorize ?

@carlkl
Copy link

carlkl commented Nov 8, 2022

No, but compiling the test program with -fno-tree-vectorize does not help.
OpenBLAS wasn't compiled by me with this flag. I was in mind this is not necessary anymore in trunk?
However I can compile OpenBLAS again and set this flag globally; but this will take some time.
What about -fno-common?

@carlkl
Copy link

carlkl commented Nov 8, 2022

Or -ffp-contract=off?

@martin-frbg
Copy link
Collaborator

Neither of these should "lead to" segfaults, but the tree vectorizer is/was at least a known source of problems with gcc12.
I assume the segfault is somewhere inside dsytrd but the backtrace only shows the invocation as it is the last bit of instrumented code.

@brada4
Copy link
Contributor

brada4 commented Nov 9, 2022

You need to manipulate "global" setting in COMMON_OPT in Makefile.rule, and run make clean between builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants