Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for SkylakeX / AVX512 #1589

Merged
merged 16 commits into from
Jun 6, 2018

Conversation

fenrus75
Copy link
Contributor

@fenrus75 fenrus75 commented Jun 3, 2018

This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:

  1. 512 bit wide SIMD (2x width of AVX2)
  2. 32 SIMD registers (2x the number on AVX2)

This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".

Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.

fenrus75 and others added 3 commits June 3, 2018 07:58
This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)

This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".

Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.
@martin-frbg
Copy link
Collaborator

Build failures are probably due to lack of AVX512 support in the old compiler/assembler versions used on travis - same build runs fine locally. (I still wonder if we should add a test for this in c_check, as wanting to do a DYNAMIC_ARCH build on an older system may be "more legit" today than the same for Haswell/avx2 ?)
The appveyor DYNAMIC_ARCH build used to barely make it under the default 1h limit even before the addition of yet another CPU type...

@fenrus75
Copy link
Contributor Author

fenrus75 commented Jun 3, 2018

well I'm only half sympathetic to DYNAMIC_ARCH if you don't have a toolchain that supports AVX512... likely best to somehow not use it.

You can make an argument that on linux, with a sort of modern glibc you want something else;
you'd want a generic binary in /usr/lib64, a haswell style binary in /usr/lib64/haswell and a skylakex binary in /usr/lib64/haswell/avx512_1 .. glibc will pick up the right one. And with this you can compile the library with good compiler flags
(this is how the Clear Linux distro ships openblas)

@fenrus75
Copy link
Contributor Author

fenrus75 commented Jun 3, 2018

maybe SKYLAKEX shouldn't be in DYNAMIC_ARCH for now? might be the easiest path forward

it can be hooked up later once the CI infra issues are resolved

@martin-frbg
Copy link
Collaborator

I am just testing a patch to c_check locally (along the lines of the MIPS have_msa test) to set NO_AVX512=1 if the build system is not up to the task. This would indeed only be for users of DYNAMIC_ARCH who do not even expect to include support for such "modern" targets in their library.

@martin-frbg
Copy link
Collaborator

Well, probably best to remove SKYLAKEX from DYNAMIC_ARCH for the moment, and re-enable it once appropriate autodetection is in place.
I expect one would need to expand or clone suppport_avx() in cpuid_x86.c (build-time) and driver/others/dynamic.c (run-time detection) to modify the target type returned for the relevant cpuid numbers.
(As it is now, your SkylakeX code would never get called in DYNAMIC_ARCH mode anyway, and without DYNAMIC_ARCH it would only get built if one forces TARGET=SKYLAKEX. Which is probably fine at this stage - though perhaps it could be unconditionally enabled for Knights Landing to improve the situation with #991 ...)

@martin-frbg
Copy link
Collaborator

Never mind - the autodetection code is basically in place already, just that it has been treating Skylake X the same as Skylake so far. I'll try to fix this up with the NO_AVX512 define for backward compatibility later today.

@fenrus75
Copy link
Contributor Author

fenrus75 commented Jun 3, 2018

you're clearly much more familiar with this part of the code than I am, but if there's something I can do to help let me know.

In the mean time I'll see what it takes to get avx512 DGEMM going...so far it looks using dgemm_kernel_16x2_haswell.S and not dgemm_kernel_4x8_haswell.S as a base is the way to go for that.

@fenrus75
Copy link
Contributor Author

fenrus75 commented Jun 3, 2018

(also my patch won't run on KNL as it is; it's using AVX512VL not just base AVX512F. I can make it work on KNL I suppose, I'll poke around the office for a system to test if it's important)

martin-frbg and others added 7 commits June 3, 2018 23:13
this required switching to the generic gemm_beta code (which is faster anyway on SKX)
for both DGEMM and SGEMM

Performance for the not-retuned version is in the 30% range
@martin-frbg
Copy link
Collaborator

Not sure - should be "extern..." with the semicolon and "define" without it I think... sorry for trashing your PR like that BTW...

@fenrus75
Copy link
Contributor Author

fenrus75 commented Jun 4, 2018

sorry never mind, needed more coffee.

I don't think this is thrashing, it's more "collaborating" ;)

@martin-frbg
Copy link
Collaborator

Looks better now, but I have no idea why the two travis checks with the old llvm 3.4 fail - avx512 support appears to be there (at least enough of it to not trigger the check I added), and the library builds locally with llvm 3.9 (the oldest I have available).
The failing appveyor build uses llvm 6, but still cannot handle avx512 instructions for some reason
(maybe the defaults are different on windows), and the cmake system is still lacking a test for that.

@fenrus75
Copy link
Contributor Author

fenrus75 commented Jun 6, 2018

I seem to be missing where your check is ;-)
(but it's before coffee)

@fenrus75
Copy link
Contributor Author

fenrus75 commented Jun 6, 2018

it appears that for the clang build, not a correct -march is passed somehow

@martin-frbg
Copy link
Collaborator

This is probably peculiar to the build host - at least on the next appveyor run it should fail to compile the vaddps %zmm1, %zmm0, %zmm0 test case and alias SkylakeX to Haswell instead of bombing out. (The two failing clang runs in travis must have a different problem, but I have not found out what it is.)
I do not see an easy way to add -march=skylake-avx512 to the CFLAGS if the build host does not support it (in which case tools like getarch would not run anymore)

@martin-frbg
Copy link
Collaborator

Closing and reopening to trigger a CI rebuild

@martin-frbg martin-frbg closed this Jun 6, 2018
@martin-frbg martin-frbg reopened this Jun 6, 2018
@martin-frbg martin-frbg merged commit cf234a0 into OpenMathLib:develop Jun 6, 2018
@martin-frbg martin-frbg added this to the 0.3.1 milestone Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants