-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing for POWER8 on big-endian #2299
Comments
From my latest tests Power9 passes blas-tester for little endian. Just I should lookat lapack failures. |
POWER9 probably passes on LE, since this is where most attention is at, but FreeBSD is currently BE-only. If POWER9 works as well, even better, but it currently fails. |
pkubaj where do you get failures with big-endian POWER9 ? There were some recent (post-0.3.7) fixes (#2233, #2263,#2269) for getting the POWER9 code compiled correctly with both gcc 9.2 and various older versions of gcc and binutils but I am not sure if anyone looked for potential issues with endianness yet. |
This test fails when optimizing for POWER9 on BE:
Other than this one, there may be other failing later. |
I suspect this may actually be SGEMM failing (test calls SPORTF followed by SGEMM then compares against initial input), there will probably be corresponding failures in the lapack-derived test/ctest as well. Unfortunately my current OpenPower host does not appear to provide big-endian POWER9, @quickwritereader does yours ? |
If you have a bare-metal host, you can create VM's in any endianness on POWER. So you can create a BE VM on LE host. If that's not possible, I can give you SSH access to my host (just note that it runs FreeBSD). |
I did not consider endiannes as my goal was little endian. For that one should change swap masks and and also check save macroses. Eapecially with single precisions. |
@martin-frbg I think we can request big endian one. But the current one is a little endian |
Right now I see my recent fixes with the gcc7-generated assembly backfiring on power8be - at the very least the assembler is not going to accept the .localentry and .size statements generated by the little-endian compiler... |
I think you don't actually mean endianess here, but ABI version (ELFv2). This concept is often mixed even by IBM. Creating little-endian OS with ELFv1 is pointless nowadays, but OpenSUSE had a LE ELFv1 variant a few years ago, when ELFv2 was only starting. However, there are actively developed systems that are BE for POWER and use ELFv2 ABI, e.g. Adelie Linux (BE exclusively on PPC) or Void Linux (it has variants for both LE and BE, but all are ELFv2). After your message, I also tested develop on ELFv1 system and it fails to build, but 0.3.7 builds there just fine (for POWER8, I use GCC 9.2), but fails with tests:
If there are no plans to fix it I think we'll just offer an option to optimize for POWER8 only on ELFv2. I also think that you shouldn't "fix" compiler problems with such workarounds. OpenBLAS users probably use the latest compilers anyway to squeeze more performance and GCC 9 works. |
What I wrote may be a little unclear, so a short summary (all variants are BE): So POWER6 works everywhere, POWER9 doesn't work everywhere, POWER8 works on ELFv2. |
Thanks for the explanation - ELFv1 it is then (and I am checking if the problem can be papered over with a few |
AFAIK the proper check is |
Right, but I wanted to try first if my trivial fix was sufficient - unfortunately it is not (segfaults in all the min/max/axpy functions that I tampered with to work around miscompilation by older gcc). Guess I will either have to try and merge the differences seen in the respective gcc-generated assembly file for big-endian, or revert my kludgy fix for #2254 before a 0.3.8 release. |
I have checked it now:
I also did those tests for POWER6:
So yeah, I think we'll just stay with PPC970 and option for POWER6. Thanks for help! |
BLAS-Tester suggests the majority of the POWER8BE problems may actually be with DAMIN/DAMAX rather than the less readable (and old) DGEMM/ZGEMM kernels. |
I could join on handling big endiannes case. |
They certainly don't deliver a consistent message. On the one hand, they are truly saying that BE is obsolete and everyone should migrate to LE. On the other hand, I read some announcements from them that they are committed to supporting both BE and LE both in Linux (there are no plans to remove ppc64 from Linux) and in i/AIX. |
#1997 has/had someone from the big-endian branch of IBM trying |
IDAMIN/MAX and IZAMIN/MAX to be precise - replacing them with their generic equivalents from ../arm makes the lapack tests pass again with the "usual" error count of 1/0/1/0. |
Since 0.3.8 supposedly has those errors fixed, I tried it again:
This is OpenBLAS 0.3.8 on FreeBSD on ELFv2, optimizing for POWER8. Should I assume that the test cases that fail are actually wrong and it's ok to offer OpenBLAS builds for POWER8 BE now? |
No, that does not look good - what compiler version is this ? |
GCC 9.2. |
For the record, this is POWER issue, not a general FreeBSD issue since amd64 seems ok:
|
Also, the errors seem to be caused by ELF standard. 0.3.8/POWER8/ELFv1:
8100 of those COMPLEX fails are in Mixed-Precision-linear-equation-routines-LIN/xlintstzc. |
Just did a test on 0.3.10 for POWER8 on ELFv2:
So there are 24 other errors. Did you get the same results? |
Don't remember seeing this, but will need to retest. ("Other error" usually is "on entry to XX parameter YY had an invalid value" so not a good thing to see) |
BTW, same happens when trying POWER9, so it's possible that this is a bug in some shared code and once it's fixed, POWER9 kernels will also work on BE. |
POWER9BE gets mapped to POWER8BE currently, as the current POWER9 kernels were written for LE only |
Curiously I am getting the same numbers in the "nb tests run" (and "numerical error") columns as you, but only zeroes for "other error" (and no hint of anything else wrong in testing-results.txt either). |
This issue is now part of a closed milestone; should probably be shifted to 0.3.11? |
Think actual breakage for which it was created is fixed, and I failed to reproduce the "other errors" pkubaj saw (possibly related to compiler version). So more inclined to close it again unless there is new input. |
testing_results.txt |
|
Seems to be tests for various error exits in some of the newer "2stage" algorithms that are failing (by setting a wrong error code) |
The code returned by xerbla in these cases (well, I only looked up zhbevd_2stage right now) is the number of off-diagonal matrx elements that did not converge to zero, so apparently we are just doing a little better than expected with some iterative algorithm. (Possibly accuracy issue due to FMA operations ?) |
I'm on GCC 9.3.0. Should I test 10.1.0? |
Trying to install that right now - I just double-checked that my freebsd vm in the unicamp minicloud has 9.2.0 so that could explain differences in our results. (I am still not convinced that these are actually harmful) |
10.1.0 gets me 8100 "other errors" in COMPLEX and around 20000 numerical errors in COMPLEX16 at our default optimization levels. Lets see if dropping gfortran optimization to O1 or O0 helps... I hate moving targets :( |
Instead of reducing gcc optimization to O1 globally, it is sufficient to add a pragma to caxpy.c and disable the assembly microkernel in zdot.c. |
Sorry for the long lag, I was busy with other things. What do you mean by disabling assembly microkernel in zdot.c? https://github.com/xianyi/OpenBLAS/blob/develop/kernel/power/zdot.c has no assembly and https://github.com/xianyi/OpenBLAS/blob/develop/kernel/power/KERNEL.POWER8 doesn't list zdot.S. |
zdot.c starts with an |
BTW, regarding those XERBLA test errors. By "apparently we are just doing a little better than expected with some iterative algorithm", do you mean those errors are safe to ignore? It would be nice to get this issue closed :) |
I am guessing that they are safe to ignore, unfortunately there are no clear guidelines for how to assess testsuite failures (and the testsuite is not free from bugs itself). And if what I wrote above is still correct, reducing the gcc optimization level for caxpy and disabling the zdot microkernel took care of these anyway. Unfortunately my minicloud account seems to have expired in the meantime, and the gcc compile farm does not offer this combination of HW and OS, so it may take me a few days to get back to this. If you can test, adding
at the start of caxpy.c and adding something like |
I patched those files:
But still experience the same issue (that's on ELFv2):
Since those failures are probably not an issue, I guess it's ok to close this. |
Which gcc ? ISTR that patch was needed to work around a ton of "other error"s seen with 10.1 |
Ah, ok. I was still on 9.3.0. I'll test with 10.1.0. |
Hm, same issue after applying that patch:
|
Probably. What I meant was that it looked a lot worse with 10.1 before that patch. |
Regained minicloud access, the result I get for the current
(so interestingly no "other errors" counted but still about 1700 fewer COMPLEX tests completed compared to your latest results - incidentally exactly the same number as you saw on Jun 15 with gcc9.3 - and later with 10.1) |
Thanks, I guess it's something about my environment then. I'll soon post an update to 0.3.10 for FreeBSD and add an option to use POWER8 kernels. Thanks! |
We're preparing an update of OpenBLAS port in FreeBSD (to 0.3.7) and I'm doing tests on POWER. Previously, we optimized for PPC970 with an option to optimize for POWER6. Now I also tested newer CPU's (they didn't work before). POWER9 still fails, but it looks like OpenBLAS built for POWER8 passes all tests. This is all done on big-endian.
Can I conclude that it's safe to optimize for POWER8 on big-endian variant?
The text was updated successfully, but these errors were encountered: