-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Knights Landing support #991
Comments
Xeon phi is not a general purpose CPU, it does not support some normal instructions available since i386, and from your measurement it looks that many others are emulated in firmware.. BTW can you point to any reference that MIC is based or related to Haswell in any way? |
As far as I know, the Knights Landing generation of Phi is binary compatible with Haswell - even if some or even most of that is likely to be microcode emulation trickery, Haswell is probably still a better OpenBLAS target for it than the Atom architecture it is somewhat distantly related to. |
CPUID says no MMX or SSE (With other hand intel writes you have to cjeck cpuid bits before emitting special instructions) |
Martin Kroeker [email protected] writes:
Indeed. [As far as I know, all Xeons are microcoded anyhow.]
For what it's worth, ATLAS has Knights Corner (I think) support, but I
Unfortunately not, but I could suggest another site to try. Also, there There is avx512 support for small matrix multiplication in |
CPUID says no MMX or SSE
Please stop talking rubbish about it, not that those are relevant.
# cpuid -1 | egrep 'MMX|SSE |simple synth'
(simple synth) = Intel Xeon Phi x200 (Knights Landing), 14nm
MMX Technology = true
SSE extensions = true
XCR0 supported: SSE state = true
I've no idea in what way that it's not "general purpose" either, running
that vanilla CentOS 7 like any other server.
(With other hand intel writes you have to
cjeck cpuid bits before emitting special instructions)
Yes, I modified dynamic.c.
https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf#section.B.8
Would be interesting to at least measure GEMM to find if SSE2 or AVX
or AVX2 works best)
It wouldn't; we know how they compare.
|
How do we know? That spec explicitly says there is not even MMX in CPUID.... While I dont really believe that .... At least can you check Sandybridge vs Haswell (With seconds, not 2-5x faster, and we know it) Emulator will not help with cache timings... |
My knowledge may well be outdated, but isn't ATLAS "just" plain C with some clever functions to tune loop unrolling etc. for the build target (making it easier to "port", but harder to match well-written machine code for a given target) ? And thanks for the pointer to libxsmm - the goals and license seem similar enough, but at first glance the code organization looks too complicated to even suggest trying a quick-and-dirty replacement of one of OpenBLAS' Haswell-optimized functions with its AVX512-using equivalent. |
You wrote:
It has assembler kernels, if that's what you mean. Knights Corner is For what it's worth, I just got a pointer to some level of support for |
You wrote:
Please stop responding like this; it's only likely to drive people away |
OK, on MIC emulator fastest kernel is one that matches host CPU. |
I work for Intel. |
Linux has no trouble figuring out the
|
Thanks for your insight. (Actually I am not sure we still need to argue about Haswell compatibility or the chance to deduct it from cpuid features.) Dave Love's patch was committed as part of PR #1010 on november 7 and from similar open issues I consider it likely that not all of the observed performance difference to MKL comes from not using avx512. |
@martin-frbg Certainly not using AVX-512 is the leading order reason that performance is behind MKL, but one needs to consider the KNL cache hierarchy (no L3, L2 for each tile of 2 cores) carefully. Finally, using instruction encodings wider than 8-bytes reduces performance. The Intel manuals and Agner Fog's website have details. |
(Already good info that Knights Landing can run Haswell) |
Jeff Hammond <[email protected]> writes:
* Mixing SSE and AVX instructions is a bad idea but there is no good
reason to do this.
[For what it's worth, it seems there is reason in the case of fftw, not
that I'm suggesting it here, obviously
<FFTW/fftw3@579cec9>.]
* [LIBXSMM](https://github.com/hfp/libxsmm) is written by some of the
smartest people at Intel and is another great resource for microkernel
insight.
As mentioned previously, but the paper is now published to marvel at.
Its non-SMM DGEMM I referenced earlier didn't do significantly better
than openblas on KNL qua Haswell when I measured it some minor versions
ago.
* BLIS already supports KNL
([code](https://github.com/flame/blis/tree/master/kernels/x86_64/knl))
well and should be consulted for BLAS implementation insight.
Thanks! I'm sure I looked, but may have been fooled by it not being
released. Do you happen to know if there are performance figures
available?
|
@loveshack Thanks for the pointer to that FFTW issue. I commented already. I have seen BLIS performance on KNL but the data is not mine to share. You might try to measure yourself if you have a KNL system. I won't have time to do it for a while. |
A while ago I got access to KNL again, got some results, and intended to pursue them further but didn't before losing access again. Here's what I have for serial dgemm with BLIS (knl kernel as of 2017-02-20) on KNL 7290, compared with MKL, OpenBLAS 0.2.19 for Haswell, and what was meant to be libxsmm's own dgemm (which needs checking for suspicious similarity to OB). |
Thanks for that update, a pity indeed that OpenBLAS fares that poorly. I may have "more time to spend with friends and family" in the near future so maybe I will actually get around to learning assembly some day. |
LIBXSMM calls BLAS DGEMM when it lacks a native implementation. Because you are using large matrices, it will not surprise me at all if that is what is happening here. You should try to use LIBXSMM with your own cache blocking to ensure that it is called such that it will execute it's own code. |
Also, LIBXSMM will do a lot better than an AVX2 kernel on KNL, so I do not believe you are measuring LIBXSMM here. LIBXSMM usually beats MKL on KNL for small matrices. See publications listed on the LIBXSMM GitHub page for details. |
LIBXSMM calls BLAS DGEMM when it lacks a native implementation.
Because you are using large matrices, it will not surprise me at all
if that is what is happening here.
It was meant to be linked against the "noblas" library variant, which
defines dgemm_; something doubtless went wrong with that, despite trying
to check the linkage at the time. Unfortunately I lost the box before I
could re-check.
[Correction that I should have made before in case of confusion: it should be the libxsmm
"ext" variant, not "noblas", but that won't have been the reason for the result.]
|
I'll try to reproduce later.
If you need KNL access, I can add you to my NERSC project so you can use Cori. Write me privately if this is of interest.
|
I could have sworn I'd updated this a while ago... I guess it's It turns out that the libxsmmext library I tried before is only I've tried BLIS again, after getting the 0.3.0 release working on Also, I noticed these slides on KNL DGEMM, which might be of interest: |
AVX512 seems to be available in the low-end i3-8121U now, which should provide a much cheaper testbed. |
@martin-frbg Intel SDE is a free testbed for AVX-512. Sure, it's just an emulator, but it is great for getting the code working. The performance of AVX-512 will be quite different on Core i3, Knights Landing, and Skylake Xeon, so I don't see a lot of utility in buying a Core i3 for AVX-512 support unless that is your primary target for OpenBLAS. Also, the low-end Xeon Scalable and Xeon W parts with AVX-512 support are pretty cheap (e.g. Xeon Bronze 3104 and Xeon W 2123), although Xeon Bronze with one VPU will have different performance characteristics than the Gold and Platinum parts with two VPUs. |
In addition to what Jeff said, I'm sure there will be people to test on
real SKX hardware as well as KNL, if it gets that far, and I'd have
thought someone would be able to provide development access, though I
can't.
There isn't so much of a need for it now though, with BLIS' dynamic
dispatch on x86_64 recently released.
|
I have a NERSC allocation and will do my best to provide accounts to anyone who is going to port OpenBLAS to KNL. However, I'd like to see those parties demonstrate interest by doing a functional port with SDE before requesting a NERSC account for them. |
cpuid(1) on a Knights Landing system shows:
So this patch detects it as HASWELL:
Is there any chance of AVX512 support? (Sorry I couldn't fund it.) Unfortunately, with HASWELL on KNL, OpenBLAS dgemm is about three times slower than MKL.
The text was updated successfully, but these errors were encountered: