-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thread safety in openblas 0.2.20 #1425
Comments
I trust you doing There is no OMP hierarchy support and you will oversubscribe cores paralellizing parallel openblas, that means slow but accurate. OPENBLAS_NUM_THREADS has no effect on OMP build. Can you try development version (if not friends with git or svn then go to "code" tab of project and download source zip of code repository)? Good to know it is fixed... Can you make synthetic sample code with small random/checkerboard/ones/zeroes input matrices? Actually later three turned into graphics bitmap reveals like of-by-one loops etc. Accuracy problem is certain pointer that something is not thread safe (or you did not run "make clean") |
Many thanks for the swift response. I issued the command make clean before compiling openblas v. 0.2.20 and also before compiling my own code. I have redone it to be sure, and the results are still wrong, when OMP_NUM_THREADS is larger than one. With the development version 0.3.0 (just installed), it is the same. I should have been more explicit. The results are not just wrong. But for different program runs, the results differ using openblas 0.2.20 and 0.3.0. |
Sorry, I closed this by mistake |
All those functions come from reference lapack. Just saying they don't work does not help. At least a*a' should serve as illustration on how wrong is the result, and include your cpu type. |
Sorry, for not being specific enough. |
simple profiler like perf record + perf report may give some idea of call chain (you can take out your function names) |
You could try compiling OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1 set in case the problem comes from threaded GEMM. Changes between 0.2.19 and 0.2.20 included a complete LAPACK update and several thread safety fixes that may have introduced new problems. If providing a code sample is not an easy option but you still have a strong interest in getting this solved, perhaps you could try using |
Hi, |
Trivial tests with odd inputs did not yield results: |
Hi, 1153e3a is the first bad commit
:040000 040000 f95854d7e6259ea7f4f0a379816254750fb34bef 9ad1e3dec1776d590f5bb07639ecfc27ea333405 M lapack-netlib I hope this will help you in finding the problem. Let me know, if I can do further tests. |
Thank you. The only functional change from that commit appears to be the unconditional removal of the -fopenmp compiler flag when building LAPACK. According to comments in Makefile.system (and the corresponding cmake file derived from it) this is/was somehow necessary for building with gcc on Windows, but I do not see why it should suddenly become universally required. Possibly this was a peculiarity of wernsaar's mingw installation at that time (not getting detected as OS_WINDOWS and |
Hmm. Seems I had stumbled on this oddity in #1269 already, but unfortunately without realizing its significance. |
Even more deja vu - I was bitten by (an earlier version of) this myself in #329, and the only rationale for disabling OpenMP on Windows seems to have been a LAPACK test failure when compiling 0.2.8 with the then-current version of TDM-GCC (#286) - and haven't we had seen oddities with this particular port of gcc since then ? |
I think it is worth mentioning somewhere in FAQ that one cross-builds on mingw for Win32 host like from linux to Win32 . I find later option easier. Ermm.. what it just changing /quickbuild.win scripts to instruct cross-builds? |
The drawback is that it would require one to have a Linux system available, or be capable and willing to set one up for this purpose. It could be mentioned on the wiki page I guess, but my feeling is that it has become confusing enough with the recent addition of experimental cmake/flang builds. |
In this case I share doubt if mingw cross-compiler (essentially renamed gcc.exe) would help. |
I followed the instructions to modify the Makefile and, indeed, this has resolved the problem with openblas thread safety. I have tested this on the platforms for which I originally reported the problem (i7-4770 and i5-4210M, gcc and gfortran v. 7.2) and additionally on an AMD TR 1950X that I got only last week. For both v. 0.2.20 and the current development version, the modification of the Makefile solves the problem. In my original post, there was another problem (very unclearly described though) regarding the performance of the Cholesky decomposition. I had entirely forgotten to mention that I used DPPSV for packed format, which did not at all perform well using openblas (seemingly no parallelisation) but is a factor of 9 or so faster using sequential MKL and also scales well with the number of cores using MKL. Replacing the call to DPPSV by a call to DTPTTR (conversion to regular format), followed by a call to DPOSV (Choleski decomposition) solved this problem. Using openblas and just a single thread, the sequence of DTPTTR and DPOSV is 8-10 faster than DPPSV and increasing the number of threads the sequence of DTPTTR and DPOSV seems to scale very well. Using this modification, my code (which calls various BLAS and LAPACK routines) linked to openblas is almost on par to when it is linked to mkl (roughly 10% slower, depending on problem size). I think this is very good. Many thanks for the work you put into openblas! There are two points that, depending on your response, I may want to formulate as suggestions for future development:
|
|
@tkswe88 you could try the blis framework https://developer.amd.com/amd-cpu-libraries/blas-library/ which has optimized kernels for the Zen architecture. But indeed, optimzation for the new processors could be a goal for future OpenBLAS releases. |
Could also be that getarch is mis-detecting the cache sizes on TR, or the various hardcoded block sizes from param.h for loop unrolling are "more wrong" on TR than they were on the smaller Ryzen. Overall support for the Ryzen architecture is currently limited to treating it like Haswell, see #1133,1147. There may be other assembly instructions besides AVX2 that are slower on Ryzen (#1147 mentions movntp*). |
@brada4: Many thanks for the clarification regarding the frequencies employed in AVX and AVX2 in the TR1950X. I will follow your advice setting the OMP_NUM_THREADS variable. |
It will not work, OpenBLAS will always spin up full number of top hierarchy threads, no matter it is called at later level. What you asked to martin - copied parameters may need doubled or halved at least here: https://github.com/xianyi/OpenBLAS/pull/1133/files#diff-7a3ef0fabb9c6c40aac5ae459c3565f0 |
You could simply check the autodetected information in config.h against the specification. (As far as I can determine, L3 size seems to be ignored as a parameter). |
At least one of our travis checks appears to still use gcc 4.6, so it falls over the OMP TASK DEPEND.
(from https://github.com/xianyi/OpenBLAS/blob/develop/lapack-netlib/SRC/chetrd_hb2st.F#L517) |
Regarding thread-safety of fortran programs, there is a variable in Makefile.rule which has to be set according to the explanation:
I found some further explanations in: |
In one of my simulation codes written in FORTRAN2008, there is a loop over a number of signal frequencies and for each of these frequencies, a system of linear equations is solved using calls to ZGBTRF or ZGBTRS in the loop body. Since there is some effort required to generate the system matrix and rhs vector for each frequency, I decided to parallelize the loop using OpenMP meaning that only a single thread would be used to solve the system of linear equations at a given signal frequency. This used to work fine with openblas 0.2.19 and the Intel Math Kernel Library.
After compiling openblas 0.2.20 using gcc and gfortran versions 6.3 or 7.2 on Ubuntu 17.10 using make TARGET=HASWELL USE_OPENMP=1 BINARY=64 FC=gfortran, I have trouble obtaining correct matrix inversion results in the parallelized loop issuing calls to ZGBTRF or ZGBTRS in the loop body, when I increase the number of threads to work on the parallelized loop to be larger than one using the OMP_NUM_THREADS variable. It seems that this is a problem with thread safety in openblas 0.2.20. A workaround I tested is to set the OPENBLAS_NUM_THREADS variable to 1 just before the start of the loop using "call openblas_set_num_threads(1)" in the fortran source code. However, this sets OMP_NUM_THREADS to 1, too. So, even though the results of the workaround are correct, this is not satisfactory, because the loop is executed in sequential order. From the description given in the current README.md file, this behaviour is somewhat unexpected.
I hope that this description helps you to localise the problem. I would be more than happy to assist checking updates using my code.
By the way, do you have any plans to write parallelised versions of the Cholesky decomposition (DPPTRF and DPPTRS)?
Keep up the good work!
The text was updated successfully, but these errors were encountered: