Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R result error when with skylake-avx512 build #2168

Closed
ghost opened this issue Jun 20, 2019 · 11 comments
Closed

R result error when with skylake-avx512 build #2168

ghost opened this issue Jun 20, 2019 · 11 comments

Comments

@ghost
Copy link

ghost commented Jun 20, 2019

With 0.3.6 build targeting skylake-avx512, one of Rbenchmark will fail as result wrong (detail attached below). This remains the same under both USE_OPENMP=1 or 0. The openblas is built with GCC v9.1.1. R version is 3.6.0.

The R test code is attached:
R-benchmark-25_issue_test.R.txt

Output of the test is as below:
Loading required package: Matrix
Error in chol.default(a) :
the leading minor of order 2 is not positive definite
Calls: run -> system.time -> chol -> chol -> chol.default
Timing stopped at: 0.014 0.017 0.031
Execution halted

@brada4
Copy link
Contributor

brada4 commented Jun 20, 2019

R is single-threaded multi-processing thing, you can just type "make" in openblas build to get native pthread version, optimal as Rblas.so replacement.

Though I am almost confident it is compiler problem -c heck #2154 for compiler flags to work with GCC9 - can you checl if older release of your system with not that radical compiler version does work.
NOTE: distributors R suffers from same ABI problem in this case XOR check C flags in R Makeconf file.

Can you do some more tests:
A/ Is there any difference between base::chol() and Matrix::chol()
B/ If function still fails - build openBlas with TARGET=Haswell (or if built with DYNAMIC_ARCH, run with OPENBLAS_CORETYPE=Haswell)

@martin-frbg
Copy link
Collaborator

Are you using 0.3.6, or a snapshot of the current develop branch ? If 0.3.6, you are probably missing #2111 - assuming that the R benchmark is using double precision. (0.3.6 unfortunately had only a partial workaround for the broken AVX512 DGEMM kernel, issue #1955).

@brada4
Copy link
Contributor

brada4 commented Jun 20, 2019

@martin-frbg R is using only double precision, the default Rblas(.so|.dll) does not include single precision codes.

@ghost
Copy link
Author

ghost commented Jun 28, 2019

@brada4 Build with 'Target=Haswell' does not report this issue. And, regarding your comment to martin's suggestion, a bit confused that whether martin's suspect is still valid?

@brada4
Copy link
Contributor

brada4 commented Jun 28, 2019

Depends on how often you reinstall your R
You can build develop today and place libopenblas.so.0 where libRblas.so is in R installation, if you manage a symlink then you can rebuild every now ant then.

If you want to keep R for year(s) you might try to build reduced version without risky codes:
NO_AVX512=1 // disable skylake codes that are still not that stable
NO_LAPACK=1 // there is some breakage in fortran ABI regarding single-character parameters, so you will use libRlapack.so that has aligned ABI with R itself.

It might help if you share benchmarks between this and Microsoft R (that includes MKL), so we understand better how much worse is haswell code on skyx

@ghost
Copy link
Author

ghost commented Jun 28, 2019

So you mean AVX512 version of openblas is not quite stable as of today, and recommend to use Target=Haswell ?

@martin-frbg
Copy link
Collaborator

The AVX512 support in OpenBLAS is practically just SGEMM and DGEMM. There are no known problems with the SGEMM, but the DGEMM code is now known to produce wrong results. Unfortunately I had disabled only parts of it in 0.3.6 which still allowed errors for some inputs. (Sadly I have failed to find the errors in the original code contribution, and its author appears to have been busy elsewhere during the last three or four months.)

@brada4
Copy link
Contributor

brada4 commented Jun 28, 2019

In R context only DGEMM matters, so yup, it is not stable, though there is a good hope it is disabled well enough

@SarahGoslee-USDA
Copy link

I can confirm that this bug in OpenBLAS on AVX-512 is "in the wild" with practical implications - I just spent a great deal of time trying to figure out why a particular colorspace conversion was failing, and it turns out to have been due to this.

The following R code gives ridiculous results (different values for the same color) on an Intel i9-7900X CPU running Fedora FC30 workstation with R 3.6.0 and openblas 0.3.6-2.fc30

converts red from RGB to Lab colorspace

white <- c(x = 0.953205712549377, 1, y = 1.08538438164692)
red10.rgb <- structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), .Dim = c(10L, 3L), .Dimnames = list( NULL, c("r", "g", "b")))
convertColor(red10.rgb, from = "sRGB", to = "Lab")

but correct answers if R is started with AVX-512 disabled:

env OPENBLAS_CORETYPE=Haswell R --vanilla

@brada4
Copy link
Contributor

brada4 commented Jul 19, 2019

You can compile with DYNAMIC_ARCH=1 NO_AVX512=1 to avoid any suspect strains.
I was thinking of adding toplevel description of this knob aka Makefile.rule, but parameter works anyway.
Thank you for fine reproducer

@martin-frbg
Copy link
Collaborator

Fixed now by wjc404's new AVX512 DGEMM kernel from #2286.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants