AVX512 DGEMM kernel #2286

wjc404 · 2019-10-15T18:05:33Z

Originally designed for icopy_8 and ocopy_8, add transformation function to support icopy_4 (with ~5% performance loss), currently 80-85% MKL performance.

martin-frbg · 2019-10-15T18:50:05Z

Great, thanks. I have been trying off and on to make fenrus75's dgemm code work but was not really getting anywhere with it.

martin-frbg · 2019-10-15T21:54:43Z

Seems the OSX libc lacks aligned_malloc, and a gcc version later than 7 is required for the
"_mm256_cvtsd_f64" intrinsic.

martin-frbg · 2019-10-15T23:33:49Z

And I am happy to confirm that this passes all tests in BLAS-Tester, the DGELSD test from #2187 and also isuruf's DTRMM reproducer from #1955 (showing that trmm was indeed "only" affected indirectly by the change of incopy, no other side effect). The lapack tests show two errors in DHS although the detailed testing_results.txt claims all tests passed.

isuruf · 2019-10-16T04:21:50Z

I'm surprised that appveyor passes even though there's no posix_memalign on windows.

martin-frbg · 2019-10-16T06:35:11Z

I suspect the one Appveyor job that uses DYNAMIC_ARCH=1 does not use an AVX512-aware clang.
(And I have restarted the xcode job that appeared to have timed out)

martin-frbg · 2019-10-16T07:11:35Z

If https://stackoverflow.com/questions/196329/osx-lacks-memalign is to be trusted, posix_memalign is not available on older versions of OSX either (but aligned_malloc is available on Windows, so probably make that one #ifndef OS_DARWIN - not sure what to use on that platform though).
Update: OSX appears to have posix_memalign since 10.6 "Snow Leopard" from 2009 so there is
probably no need for a workaround there, just use the aligned_alloc on OS_WINDOWS and posix_memalign elsewhere.

martin-frbg · 2019-10-16T10:26:48Z

kernel/x86_64/dgemm_kernel_8x8_skylakex.c

-    double *b_scratch = (double *)aligned_alloc(64,192*k);
+    double *b_scratch;
+    posix_memalign(&b_scratch,64,192*k);


rather than replacing this, turning it into

#if defined(OS_WINDOWS) ... _aligned_malloc... (note the added undescore and "M") #else ... posix_memalign ... #endif

would probably take care of platform issues ?

Thank you very much for the suggestions.
Later I found caching part of blocked B matrix into b_scratch actually makes the program slow down a bit... so no need for aligned allocation.

90% MKL 1-thread performance.

wjc404 · 2019-10-19T13:20:12Z

A 4x8 kernel is added for direct interface to icopy_4, which gives improved 1-thread performance (85 GFLOPS) on Intel Xeon Platinum 8269CY (theoretical 105 GFLOPS, MKL 95 GFLOPS).

martin-frbg · 2019-10-19T13:38:14Z

Great, thanks. Just let me know when you think it is ready for merging. I started to write a simple 8x8 trmm kernel earlier but could not get that to work, not sure if I made a silly mistake or if the existing generic kernels I used as guidance are broken. (This project has accumulated a few kernels of questionable status that are not in use by any target.)

wjc404 · 2019-10-19T14:25:12Z

I've finished updating the 2 kernel files. They will soon be ready for merging after I finish reliability tests on them.

wjc404 · 2019-10-20T09:40:25Z

For the 2 kernels, 1-thread tests with all combinations of 1<=m<=1173 && 1<=n<=85 && 1<=k<=267 passed. OK to merge.

Testing program is attached here:
dgemmtest.zip

wjc404 added 2 commits October 16, 2019 02:00

Add files via upload

844629a

Add files via upload

5da9484

Update dgemm_kernel_8x8_skylakex.c

6bd67dd

Update dgemm_kernel_8x8_skylakex.c

9b19e9e

martin-frbg reviewed Oct 16, 2019

View reviewed changes

wjc404 added 5 commits October 16, 2019 19:23

Add files via upload

b7315f8

make further changes to icopy_8 easier

6bcb06f

some correction

17cdd9f

Update dgemm_kernel_8x8_skylakex.c

0d669e0

native support for icopy_4

6ff013b

90% MKL 1-thread performance.

martin-frbg merged commit eaa0be1 into OpenMathLib:develop Oct 20, 2019

martin-frbg added this to the 0.3.8 milestone Oct 20, 2019

martin-frbg mentioned this pull request Aug 27, 2020

LU and eigen routines slower than MKL #2795

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX512 DGEMM kernel #2286

AVX512 DGEMM kernel #2286

wjc404 commented Oct 15, 2019

martin-frbg commented Oct 15, 2019

martin-frbg commented Oct 15, 2019

martin-frbg commented Oct 15, 2019

isuruf commented Oct 16, 2019

martin-frbg commented Oct 16, 2019

martin-frbg commented Oct 16, 2019 •

edited

Loading

martin-frbg Oct 16, 2019

wjc404 Oct 16, 2019 •

edited

Loading

wjc404 commented Oct 19, 2019

martin-frbg commented Oct 19, 2019

wjc404 commented Oct 19, 2019

wjc404 commented Oct 20, 2019 •

edited

Loading

AVX512 DGEMM kernel #2286

AVX512 DGEMM kernel #2286

Conversation

wjc404 commented Oct 15, 2019

martin-frbg commented Oct 15, 2019

martin-frbg commented Oct 15, 2019

martin-frbg commented Oct 15, 2019

isuruf commented Oct 16, 2019

martin-frbg commented Oct 16, 2019

martin-frbg commented Oct 16, 2019 • edited Loading

martin-frbg Oct 16, 2019

Choose a reason for hiding this comment

wjc404 Oct 16, 2019 • edited Loading

Choose a reason for hiding this comment

wjc404 commented Oct 19, 2019

martin-frbg commented Oct 19, 2019

wjc404 commented Oct 19, 2019

wjc404 commented Oct 20, 2019 • edited Loading

martin-frbg commented Oct 16, 2019 •

edited

Loading

wjc404 Oct 16, 2019 •

edited

Loading

wjc404 commented Oct 20, 2019 •

edited

Loading