`LAPACKE_dsytrd` performance degradation with multiple threads on Windows #3801

yakovbraver · 2022-10-30T13:06:10Z

(This follows from JuliaLang/LinearAlgebra.jl#960)
On Windows, LAPACKE_dsytrd seems to slow down considerably in multi-threaded mode compared to single-threaded. The test code is here. For a 50x50 matrix, the results are (Windows 10 machine with i7-6700K (4 cores, 8 threads)):

> set OPENBLAS_NUM_THREADS=1
> lapack-test.exe 50 1000
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 124.9 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 203.1 us
> set OPENBLAS_NUM_THREADS=4
> lapack-test.exe 50 1000
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 1296.5 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 249.9 us

I've also included LAPACKE_zhetrd for comparison; it is apparent that in the 4-threaded case dsytrd is even slower than zhetrd, which seems counterintuitive.
For larger matrices, dsytrd becomes faster than zhetrd, but multithreding still leads to a slowdown:

> set OPENBLAS_NUM_THREADS=1
> lapack-test.exe 500 100
dsytrd:
 Average of 100 runs for a 500x500 matrix = 10831.1 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 32492.4 us
> set OPENBLAS_NUM_THREADS=4
> lapack-test.exe 500 100
dsytrd:
 Average of 100 runs for a 500x500 matrix = 30951.7 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 59309.2 us

I could reproduce these results on four different Windows 10 machines (brief system info here). I used OpenBLAS 0.3.21 which I compiled using GCC 7.2.0 and cmake (with the default options).

For comparison, on an intel Mac (macOS 10.14.6, i5-5250U (2 cores, 4 threads)) there is only a small performance penalty (if at all) when manipulating small matrices in parallel, while for larger matrices multithreading boosts performance:

$ export OPENBLAS_NUM_THREADS=4    
$ ./lapack-test 50 1000                  
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 229.2 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 315.5 us
$ export OPENBLAS_NUM_THREADS=1    
$ ./lapack-test 50 1000                  
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 180.8 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 288.8 us

$ export OPENBLAS_NUM_THREADS=4    
$ ./lapack-test 500 100                  
dsytrd:
 Average of 100 runs for a 500x500 matrix = 14451.2 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 42615.3 us
$ export OPENBLAS_NUM_THREADS=1    
$ ./lapack-test 500 100                                                            
dsytrd:
 Average of 100 runs for a 500x500 matrix = 19221.5 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 55341.5 us

This uses OpenBLAS 0.3.21 compiled using clang from Apple LLVM 10.0.1 and GNU Fortran (GCC) 8.2.0 (for building LAPACK); cmake (with the default options).

The text was updated successfully, but these errors were encountered:

brada4 · 2022-10-30T15:07:25Z

Could you try building OpenBLAS with make as there is some suspicion that cmake -O3 in JuliaLang/julia#3787 optimized out threading threshold at least in first sight at loose backtraces.

Another try is looking at call graph
syr2 and syr2k do not have threading threshold.
Try undef-ing SMP in those files in interface/ if it is not too tough experimentation.

In the meantime I am trying to put together julia with predictabe BLAS version included.

martin-frbg · 2022-10-30T17:53:30Z

syr2 with small n is already transformed to a single-threaded axpy loop, but syr2k does indeed lack a lower threshold.

brada4 · 2022-10-30T19:07:13Z

I took 1st test from linked julia ticket, 2-cpu version is 20% slower, time shows significant time un syscalls, and perf report is dominated by pthread_mutex_(un)lock and __sched_yield appears somewhere further.
With all -g's everywhere still backtraces make not much sense to attribute BLAS function at fault.

yakovbraver · 2022-10-30T22:57:10Z

I've recompiled OpenBLAS in MSYS2 using GCC 12.2.0 and make. Execution time is better now (perhaps also because of the newer compiler), but the problem persists. On the same Windows 10, i7-6700K machine I now get for 50x50 matrices the following:

$ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test.exe 50 1000
sytrd:
 Average of 1000 runs for a 50x50 matrix = 77.4 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 140.0 us
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test.exe 50 1000
sytrd:
 Average of 1000 runs for a 50x50 matrix = 1178.5 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 195.2 us

For 500x500 matrices:

$ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test.exe 500 100
sytrd:
 Average of 100 runs for a 500x500 matrix = 7139.4 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 31313.9 us
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test.exe 500 100
sytrd:
 Average of 100 runs for a 500x500 matrix = 28348.2 us
Segmentation fault

I also now get a segfault in zhetrd for matrix size N >= 256, although that might be a problem with my test code.

brada4 · 2022-10-31T07:01:21Z

Not certainly compiler-related. What I suspected in other issue is that if input > 1mb nthreads=max else nthreads=1 is optimized out by -O3 implied by "cmake defaults" so you unconditionally get SMP codepath.

For segfault debugging you can use x64dbg (or tame mingw gdb) to see call stack at the moment of crash, if the faulty function comes from openblas.dll or your main() ....

Could you change interface/syr2k.c to #undef SMP early to skip all SMP code in that file so that unnecessary parallelism does not kick in for small matrices? That would sort of reduce test case to one single source file if small change regains 50x50 performance.

Please compare same build to same build with one file change. As much as interface/ files are concerned you can re-run make in toplevel to update few functions derived from one file without recompiling the rest.

yakovbraver · 2022-10-31T11:48:26Z

Added #undef SMP to line 45 of interface/syr2k.c, re-run make in the top level and recompiled my test code. However, the results are identical to those in my previous post, including the segfault (which I will try to debug).

yakovbraver · 2022-10-31T13:19:39Z

I've attached MinGW gdb to the program, and the stack trace seems to point to zgemv causing the segfault:

martin-frbg · 2022-10-31T16:21:13Z

The original Julia ticket seems to stress that the slowdown is a Windows-only issue, not observed on either OSX or Linux. So more likely to be a poor implementation of threading (in particular locking ?) in blas_server_win32.c, or our old friend the YIELDING macro in common.h

martin-frbg · 2022-10-31T16:22:41Z

regarding the crash, did you build 0.3.21 or current git develop ?

yakovbraver · 2022-10-31T16:36:43Z

I built the released 0.3.21 version (commit b89fb70).

martin-frbg · 2022-10-31T17:04:25Z

Could you build with -fno-tree-vectorize then, please ? (PR JuliaLang/julia#3745 , see the currently open Milestone where I track the non-trivial changes since last release)

yakovbraver · 2022-10-31T17:26:01Z

Sure! Should I set COMMON_OPT = -O2 -fno-tree-vectorize in Makefile.rule or should I set this flag elsewhere?
https://github.com/xianyi/OpenBLAS/blob/65338a9493a1dd4860ac021db47d97468180f46c/Makefile.rule#L228-L231

martin-frbg · 2022-10-31T17:30:38Z

setting it in COMMON_OPT should be safest

brada4 · 2022-10-31T17:44:21Z

Maybe windoes issue is more prominent, but there is slight slowdown with excess cpu frying on linux pthreads too. Within lost turbo range. But I see mote gold sipping once you start diging.

yakovbraver · 2022-10-31T20:46:15Z

Rebuilt the release version with -fno-tree-vectorize and the segfault is gone. For the record, the timing for 500x500 matrices is now

$ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test.exe 500 100
dsytrd:
 Average of 100 runs for a 500x500 matrix = 7386.9 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 32198.4 us
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test.exe 500 100
dsytrd:
 Average of 100 runs for a 500x500 matrix = 29271.0 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 61499.5 us

brada4 · 2022-10-31T22:32:29Z

You should help with C reptoducer. We are out of surface guessrs.

yakovbraver · 2022-11-01T10:30:33Z

The C reproducer is here — I've mentioned it in the first comment but let me know if you mean something else.

brada4 · 2022-11-04T15:08:50Z

That must be busy poll(s)? Just that so short that perf does not catch up, julia or C, does not matter, neither gets to pthread_yield calling hotsport.
`while (job[mypos].working[i][CACHE_LINE_SIZE * bufferside]) {YIELDING;};
One quick approach would be pushing threading threshold much further than it is so that inevitable busy loop does not add to execution time.
It is not quick few-liner with pthread_barrier_wait as the number of barriers to break through is not known when iterating through quickdivide. I.e all those places need to be ordered - first set all thread init conditions, then init barrier then schedule work pieces that end in _wait, then drop barrier.

martin-frbg · 2022-11-05T17:46:28Z

Did a quick test run yesterday, replacing the implementation of YIELDING with sleep(0) (as suggested in some older threads on the net) but did not observe any marked difference. One other possibility could be that my use of a CRITICAL_SECTION instead of a mutex in level3_thread.c (from PR JuliaLang/julia#1875 four years ago) puts Windows multithreading at a disadvantage. Still neither would explain why it affects dsytrd more than zhetrd.

brada4 · 2022-11-05T18:26:37Z

I tried various sleeps. They are being called in a tight loop. usleep 1 adds slightly to total wall time but reduces system to invidible

martin-frbg · 2022-11-05T18:44:32Z

that's back to the old discussion if it should be a busy loop or not, but of no particular effect on this issue (why small-size dsytrd suffers from multithreading more than zhetrd) ?

brada4 · 2022-11-05T20:39:14Z

profile is with blanks, i dare not to guess.

carlkl · 2022-11-08T12:48:24Z

I tried the mentioned lapack-test.c on Windows with a locally compiled trunk version of OpenBLAS and got this:

$ catchsegv.exe ./lapack-test.exe 500 100
lapack-test.exe caused an Access Violation at location 00007FF6998D2379 in module lapack-test.exe Reading from location 00000916ACE56FE0.

AddrPC           Params
00007FF6998D2379 0000000000000055 00007FF6998D5FF4 0000000000200000  lapack-test.exe!LAPACKE_dtr_nancheck+0x199
00007FF6998D2184 0000027EACB560B0 00007FFEE4940800 0000000000000001  lapack-test.exe!LAPACKE_dsy_nancheck+0x24
00007FF6998D18C4 00000000000001F3 0000000000000004 00007FFFFFFEFFFF  lapack-test.exe!LAPACKE_dsytrd+0xc4
00007FF6998D16A1 0000000000000010 00007FF69A6053B8 00007FF69A6053F0  lapack-test.exe!main+0x171  [C:/dev/home/tmp/lapack_test/lapack-test.c @ 97]
    95: gettimeofday(&start, NULL);
    96:
>   97: for (int j = 0; j < loops; j++) {
    98: LAPACKE_dsytrd(LAPACK_COL_MAJOR, 'U', N, A_double, N, d, e, tau_double);
    99: }
00007FF6998D13AE 0000000000000000 0000000000000000 0000000000000000  lapack-test.exe!__tmainCRTStartup+0x22e  [C:/M/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c @ 329]
00007FF6998D14E6 0000000000000000 0000000000000000 0000000000000000  lapack-test.exe!mainCRTStartup+0x16  [C:/M/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c @ 206]
00007FFEE4197034 0000000000000000 0000000000000000 0000000000000000  KERNEL32.DLL!BaseThreadInitThunk+0x14
00007FFEE49426A1 0000000000000000 0000000000000000 0000000000000000  ntdll.dll!RtlUserThreadStart+0x21

Segmentation fault

Windows 10 OS 64-bit
MSYS2 gcc-12
OpenBLAS-5d02f2e

martin-frbg · 2022-11-08T13:30:45Z

@carlkl did you build with -fno-tree-vectorize ?

carlkl · 2022-11-08T14:05:03Z

No, but compiling the test program with -fno-tree-vectorize does not help.
OpenBLAS wasn't compiled by me with this flag. I was in mind this is not necessary anymore in trunk?
However I can compile OpenBLAS again and set this flag globally; but this will take some time.
What about -fno-common?

carlkl · 2022-11-08T14:06:25Z

Or -ffp-contract=off?

martin-frbg · 2022-11-08T14:21:58Z

Neither of these should "lead to" segfaults, but the tree vectorizer is/was at least a known source of problems with gcc12.
I assume the segfault is somewhere inside dsytrd but the backtrace only shows the invocation as it is the last bit of instrumented code.

brada4 · 2022-11-09T12:34:44Z

You need to manipulate "global" setting in COMMON_OPT in Makefile.rule, and run make clean between builds.

carlkl mentioned this issue Oct 31, 2022

Segfaults with scipy/openblas since 4989e039a5 #3803

Closed

martin-frbg mentioned this issue Mar 23, 2023

Do not use multithreading for small SYMV/SYR2K workloads #3953

Merged

martin-frbg closed this as completed in #3953 Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`LAPACKE_dsytrd` performance degradation with multiple threads on Windows #3801

`LAPACKE_dsytrd` performance degradation with multiple threads on Windows #3801

yakovbraver commented Oct 30, 2022

brada4 commented Oct 30, 2022

martin-frbg commented Oct 30, 2022

brada4 commented Oct 30, 2022

yakovbraver commented Oct 30, 2022

brada4 commented Oct 31, 2022 •

edited

Loading

yakovbraver commented Oct 31, 2022

yakovbraver commented Oct 31, 2022

martin-frbg commented Oct 31, 2022

martin-frbg commented Oct 31, 2022

yakovbraver commented Oct 31, 2022

martin-frbg commented Oct 31, 2022

yakovbraver commented Oct 31, 2022

martin-frbg commented Oct 31, 2022

brada4 commented Oct 31, 2022

yakovbraver commented Oct 31, 2022

brada4 commented Oct 31, 2022

yakovbraver commented Nov 1, 2022

brada4 commented Nov 4, 2022

martin-frbg commented Nov 5, 2022

brada4 commented Nov 5, 2022

martin-frbg commented Nov 5, 2022 •

edited

Loading

brada4 commented Nov 5, 2022

carlkl commented Nov 8, 2022

martin-frbg commented Nov 8, 2022

carlkl commented Nov 8, 2022

carlkl commented Nov 8, 2022

martin-frbg commented Nov 8, 2022

brada4 commented Nov 9, 2022

LAPACKE_dsytrd performance degradation with multiple threads on Windows #3801

LAPACKE_dsytrd performance degradation with multiple threads on Windows #3801

Comments

yakovbraver commented Oct 30, 2022

brada4 commented Oct 30, 2022

martin-frbg commented Oct 30, 2022

brada4 commented Oct 30, 2022

yakovbraver commented Oct 30, 2022

brada4 commented Oct 31, 2022 • edited Loading

yakovbraver commented Oct 31, 2022

yakovbraver commented Oct 31, 2022

martin-frbg commented Oct 31, 2022

martin-frbg commented Oct 31, 2022

yakovbraver commented Oct 31, 2022

martin-frbg commented Oct 31, 2022

yakovbraver commented Oct 31, 2022

martin-frbg commented Oct 31, 2022

brada4 commented Oct 31, 2022

yakovbraver commented Oct 31, 2022

brada4 commented Oct 31, 2022

yakovbraver commented Nov 1, 2022

brada4 commented Nov 4, 2022

martin-frbg commented Nov 5, 2022

brada4 commented Nov 5, 2022

martin-frbg commented Nov 5, 2022 • edited Loading

brada4 commented Nov 5, 2022

carlkl commented Nov 8, 2022

martin-frbg commented Nov 8, 2022

carlkl commented Nov 8, 2022

carlkl commented Nov 8, 2022

martin-frbg commented Nov 8, 2022

brada4 commented Nov 9, 2022

`LAPACKE_dsytrd` performance degradation with multiple threads on Windows #3801

`LAPACKE_dsytrd` performance degradation with multiple threads on Windows #3801

brada4 commented Oct 31, 2022 •

edited

Loading

martin-frbg commented Nov 5, 2022 •

edited

Loading