Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread safety in openblas 0.2.20 #1425

Closed
tkswe88 opened this issue Jan 19, 2018 · 26 comments · Fixed by #1460
Closed

thread safety in openblas 0.2.20 #1425

tkswe88 opened this issue Jan 19, 2018 · 26 comments · Fixed by #1460

Comments

@tkswe88
Copy link

tkswe88 commented Jan 19, 2018

In one of my simulation codes written in FORTRAN2008, there is a loop over a number of signal frequencies and for each of these frequencies, a system of linear equations is solved using calls to ZGBTRF or ZGBTRS in the loop body. Since there is some effort required to generate the system matrix and rhs vector for each frequency, I decided to parallelize the loop using OpenMP meaning that only a single thread would be used to solve the system of linear equations at a given signal frequency. This used to work fine with openblas 0.2.19 and the Intel Math Kernel Library.

After compiling openblas 0.2.20 using gcc and gfortran versions 6.3 or 7.2 on Ubuntu 17.10 using make TARGET=HASWELL USE_OPENMP=1 BINARY=64 FC=gfortran, I have trouble obtaining correct matrix inversion results in the parallelized loop issuing calls to ZGBTRF or ZGBTRS in the loop body, when I increase the number of threads to work on the parallelized loop to be larger than one using the OMP_NUM_THREADS variable. It seems that this is a problem with thread safety in openblas 0.2.20. A workaround I tested is to set the OPENBLAS_NUM_THREADS variable to 1 just before the start of the loop using "call openblas_set_num_threads(1)" in the fortran source code. However, this sets OMP_NUM_THREADS to 1, too. So, even though the results of the workaround are correct, this is not satisfactory, because the loop is executed in sequential order. From the description given in the current README.md file, this behaviour is somewhat unexpected.

I hope that this description helps you to localise the problem. I would be more than happy to assist checking updates using my code.

By the way, do you have any plans to write parallelised versions of the Cholesky decomposition (DPPTRF and DPPTRS)?

Keep up the good work!

@brada4
Copy link
Contributor

brada4 commented Jan 19, 2018

I trust you doing make clean between builds with different parameters....

There is no OMP hierarchy support and you will oversubscribe cores paralellizing parallel openblas, that means slow but accurate.

OPENBLAS_NUM_THREADS has no effect on OMP build.

Can you try development version (if not friends with git or svn then go to "code" tab of project and download source zip of code repository)? Good to know it is fixed...

Can you make synthetic sample code with small random/checkerboard/ones/zeroes input matrices? Actually later three turned into graphics bitmap reveals like of-by-one loops etc.

Accuracy problem is certain pointer that something is not thread safe (or you did not run "make clean")

@tkswe88
Copy link
Author

tkswe88 commented Jan 19, 2018

Many thanks for the swift response.

I issued the command make clean before compiling openblas v. 0.2.20 and also before compiling my own code. I have redone it to be sure, and the results are still wrong, when OMP_NUM_THREADS is larger than one. With the development version 0.3.0 (just installed), it is the same.

I should have been more explicit. The results are not just wrong. But for different program runs, the results differ using openblas 0.2.20 and 0.3.0.

@tkswe88 tkswe88 closed this as completed Jan 19, 2018
@tkswe88
Copy link
Author

tkswe88 commented Jan 19, 2018

Sorry, I closed this by mistake

@tkswe88 tkswe88 reopened this Jan 19, 2018
@brada4
Copy link
Contributor

brada4 commented Jan 20, 2018

All those functions come from reference lapack. Just saying they don't work does not help. At least a*a' should serve as illustration on how wrong is the result, and include your cpu type.

@tkswe88
Copy link
Author

tkswe88 commented Jan 20, 2018

Sorry, for not being specific enough.
I got the problem on two CPUs: i7-4770 and i5-4210M, so both are from the Haswell architecture.
Sending the actual code where the problem appears is not an option, but I will try to generate a much simpler example next week.

@brada4
Copy link
Contributor

brada4 commented Jan 20, 2018

simple profiler like perf record + perf report may give some idea of call chain (you can take out your function names)
While LAPACK is single-threaded, under the hood it will call eventually multi-threaded BLAS functions and maybe/possibly in some hard to reproduce race condition overwriting each others temporary allocations.

@martin-frbg
Copy link
Collaborator

You could try compiling OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1 set in case the problem comes from threaded GEMM. Changes between 0.2.19 and 0.2.20 included a complete LAPACK update and several thread safety fixes that may have introduced new problems. If providing a code sample is not an easy option but you still have a strong interest in getting this solved, perhaps you could try using git bisect to identify the problematic commit:
Use git checkout with the revision hash of a version you expect did still work, verify that your code actually works with that version. Next checkout a newer version that you assume to be broken, again run your code with it. If it still works, pick an even newer version at random until you get one that really is broken. At that point, enter git bisect start followed by git bisect good <hashnumber> and git bisect bad <hashnumber> with the revision hashes of the two stages you just identified. This will checkout an intermediate revision for you to test, enter either git bisect good or git bisect bad depending on the result. This will narrow down the search until at some point git will respond with <hash> is the first bad commit followed by the archived commit message. At that point you can use git bisect reset to end the test and git checkout to return to the current development version.

@tkswe88
Copy link
Author

tkswe88 commented Feb 1, 2018

Hi,
I recompiled (after issuing make clean) using USE_SIMPLE_THREADED_LEVEL3=1, but this did not fix the problem. Following the hash history is definitely an option. Would you mind to get me started on this and help me with the hash numbers of versions 0.2.19 and 0.2.20?

@martin-frbg
Copy link
Collaborator

0.2.20 is 5dde4e6 , 0.2.19 is 85636ff (you can find this information when you click on the "(35) releases" tab at the top of the "Code" view)

@brada4
Copy link
Contributor

brada4 commented Feb 1, 2018

@tkswe88
Copy link
Author

tkswe88 commented Feb 12, 2018

Hi,
sorry for the long delay!
I followed the advice on bisecting and a complete recompilation (including make clean) and a new test run with my example in each step. This lead to the following output from git:

1153e3a is the first bad commit
commit 1153e3a
Author: Werner Saar [email protected]
Date: Sat Jan 7 08:41:42 2017 +0100

filtered out -fopenmp and fix for mingw

:040000 040000 f95854d7e6259ea7f4f0a379816254750fb34bef 9ad1e3dec1776d590f5bb07639ecfc27ea333405 M lapack-netlib

I hope this will help you in finding the problem. Let me know, if I can do further tests.

@martin-frbg
Copy link
Collaborator

Thank you. The only functional change from that commit appears to be the unconditional removal of the -fopenmp compiler flag when building LAPACK. According to comments in Makefile.system (and the corresponding cmake file derived from it) this is/was somehow necessary for building with gcc on Windows, but I do not see why it should suddenly become universally required. Possibly this was a peculiarity of wernsaar's mingw installation at that time (not getting detected as OS_WINDOWS and
failing to build with the flag still set). Could you make sure that your code runs correctly when you drop that single change from lapack-netlib/SRC/Makefile ? (Easiest would be to just change the two
occurrences of "$(OPTS1)" at the very end of the Makefile back to "$(OPTS)" )

@martin-frbg
Copy link
Collaborator

Hmm. Seems I had stumbled on this oddity in #1269 already, but unfortunately without realizing its significance.

@martin-frbg
Copy link
Collaborator

Even more deja vu - I was bitten by (an earlier version of) this myself in #329, and the only rationale for disabling OpenMP on Windows seems to have been a LAPACK test failure when compiling 0.2.8 with the then-current version of TDM-GCC (#286) - and haven't we had seen oddities with this particular port of gcc since then ?

@brada4
Copy link
Contributor

brada4 commented Feb 13, 2018

I think it is worth mentioning somewhere in FAQ that one cross-builds on mingw for Win32 host like from linux to Win32 . I find later option easier. Ermm.. what it just changing /quickbuild.win scripts to instruct cross-builds?

@martin-frbg
Copy link
Collaborator

The drawback is that it would require one to have a Linux system available, or be capable and willing to set one up for this purpose. It could be mentioned on the wiki page I guess, but my feeling is that it has become confusing enough with the recent addition of experimental cmake/flang builds.
Removing that easter egg from the LAPACK Makefile induces a minimum requirement of gcc 4.9, but that appears to be due to OMP4.0 directives (TASK DEPEND) in a few files that were new in 3.7.0. (Wonder if those may have prompted the controversial change?)

@brada4
Copy link
Contributor

brada4 commented Feb 13, 2018

In this case I share doubt if mingw cross-compiler (essentially renamed gcc.exe) would help.

@tkswe88
Copy link
Author

tkswe88 commented Feb 13, 2018

I followed the instructions to modify the Makefile and, indeed, this has resolved the problem with openblas thread safety. I have tested this on the platforms for which I originally reported the problem (i7-4770 and i5-4210M, gcc and gfortran v. 7.2) and additionally on an AMD TR 1950X that I got only last week. For both v. 0.2.20 and the current development version, the modification of the Makefile solves the problem.

In my original post, there was another problem (very unclearly described though) regarding the performance of the Cholesky decomposition. I had entirely forgotten to mention that I used DPPSV for packed format, which did not at all perform well using openblas (seemingly no parallelisation) but is a factor of 9 or so faster using sequential MKL and also scales well with the number of cores using MKL. Replacing the call to DPPSV by a call to DTPTTR (conversion to regular format), followed by a call to DPOSV (Choleski decomposition) solved this problem. Using openblas and just a single thread, the sequence of DTPTTR and DPOSV is 8-10 faster than DPPSV and increasing the number of threads the sequence of DTPTTR and DPOSV seems to scale very well. Using this modification, my code (which calls various BLAS and LAPACK routines) linked to openblas is almost on par to when it is linked to mkl (roughly 10% slower, depending on problem size). I think this is very good. Many thanks for the work you put into openblas!

There are two points that, depending on your response, I may want to formulate as suggestions for future development:

  1. After plenty of initial tests with the AMD TR 1950X processor, it seems that openblas (tested 0.2.19, 0.2.20 and the development version on Ubuntu 17.10 with kernel 4.13, gcc and gfortran v. 7.2) operates roughly 20% slower on the TR1950X than on an i7-4770 (4 cores) when using 1, 2 and 4 threads. This is somewhat surprising given that both CPUs use at most AVX2 and thus should be comparable in terms of vectorisation potential. I have already adjusted the OpenMP thread affinity to exclude that the (hidden) NUMA architecture of the TR1950X causes its lower performance. Other measures I took were 1) BIOS upgrade, 2) Linux kernel upgrade to 4.15, 3) increased DIMM frequency from 2133 to 2666 MHz. Except for the latter, which gave a speedup of roughly 3%, these measures did not have any effect on execution speed. Do you have any idea where the degraded performance on the TR1950X comes from? Is this related to a current lack of optimization in openblas or do we just have to wait for the next major release of gcc/gfortran to fix the problem? Of course, I would be willing to run tests, if this was of help in developing openblas.
  2. As the 1st response to my original post, brada4 commented that I was oversubscribing cores by calling parallel openblas routines inside an OpenMP parallelised loop. This is of course correct. Until very recently I had used mostly MKL, which - to my knowledge - automatically switches to sequential BLAS and LAPACK routines when these are called from within OpenMP parallelised loop. Would such a mechanism of automatically switching between sequential and threaded subroutine versions be possible to implement in openblas?

@brada4
Copy link
Contributor

brada4 commented Feb 13, 2018

  1. AMD has slightly slower AVX and AVX2 units per CPU, by no means slow in general, it still has heap of cores spare. Sometimes optimal AVX2 saturation means turning whole CPU cartridge to base, i.e non-turpo frequency.
  2. Desirable , but this is not yet in OpenBLAS. Ideally you should be able to set OMP_NUM_THREADS=4,1 or 2,2,,,, etc - size threading at all levels for optimal result.

@MigMuc
Copy link

MigMuc commented Feb 13, 2018

@tkswe88 you could try the blis framework https://developer.amd.com/amd-cpu-libraries/blas-library/ which has optimized kernels for the Zen architecture. But indeed, optimzation for the new processors could be a goal for future OpenBLAS releases.

@martin-frbg
Copy link
Collaborator

Could also be that getarch is mis-detecting the cache sizes on TR, or the various hardcoded block sizes from param.h for loop unrolling are "more wrong" on TR than they were on the smaller Ryzen. Overall support for the Ryzen architecture is currently limited to treating it like Haswell, see #1133,1147. There may be other assembly instructions besides AVX2 that are slower on Ryzen (#1147 mentions movntp*).
On the topic of DPPSV vs. DPOSV, the latter is basically POTRF for which the netlib LAPACK implementation is overloaded by an optimized and parallelized implementation in lapack/potrf.

@tkswe88
Copy link
Author

tkswe88 commented Feb 13, 2018

@brada4: Many thanks for the clarification regarding the frequencies employed in AVX and AVX2 in the TR1950X. I will follow your advice setting the OMP_NUM_THREADS variable.
@MigMuc: I tried the blis and flame libraries provided by AMD in binary form, but AMD seem to have forgotten to include the lapack interface lapack2flame. So, in the linking phase of compiling my code, there are errors saying that not a single lapack routine is found. I tried recompilation from the blis and flame source which ended up in segmentation faults when running my code. Also, while blis seems to be actively developed (it has a zen target), there seems to be very little development activity for flame on github (the manual is 4 years old!).
@martin-frbg: Are there any tests I could do to find out about cache size detection errors or more appropriate settings for loop unroling?

@brada4
Copy link
Contributor

brada4 commented Feb 13, 2018

It will not work, OpenBLAS will always spin up full number of top hierarchy threads, no matter it is called at later level. N*N*.... oversubscription.

What you asked to martin - copied parameters may need doubled or halved at least here: https://github.com/xianyi/OpenBLAS/pull/1133/files#diff-7a3ef0fabb9c6c40aac5ae459c3565f0

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 13, 2018

You could simply check the autodetected information in config.h against the specification. (As far as I can determine, L3 size seems to be ignored as a parameter).
As far as appropriate settings go, the benchmark directory contains a few tests that can be used for timing individual functions. Adjusting the values in param.h (see https://github.com/xianyi/OpenBLAS/pull/1157/files) is a bit of a black art though.

@martin-frbg
Copy link
Collaborator

At least one of our travis checks appears to still use gcc 4.6, so it falls over the OMP TASK DEPEND.
Anybody experienced with OpenMP to tell (what may be a silly question) if replacing the "OMP TASK DEPEND" with a simple OpenMP 3.0 "OMP TASK" would still be safe (if less efficient) in code fragments like

  !$OMP TASK DEPEND(in:WORK(MYID+SHIFT-1))
  !$OMP$     DEPEND(in:WORK(MYID-1))
  !$OMP$     DEPEND(out:WORK(MYID))

(from https://github.com/xianyi/OpenBLAS/blob/develop/lapack-netlib/SRC/chetrd_hb2st.F#L517)
or what else would be available in OpenMP 3.0 to achieve the desired effect ? Should this be
"OMP SERIALIZE" for _OPENMP < 201307 ?

@MigMuc
Copy link

MigMuc commented Feb 16, 2018

Regarding thread-safety of fortran programs, there is a variable in Makefile.rule which has to be set according to the explanation:

# gfortran option for LAPACK
# enable this flag only on 64bit Linux and if you need a thread safe lapack library
# Flags for POWER8 are defined in Makefile.power. Don't modify FCOMMON_OPT
# FCOMMON_OPT = -frecursive

I found some further explanations in:
http://wwwf.imperial.ac.uk/~mab201/20120814.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants