-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX512 DGEMM kernel #2286
AVX512 DGEMM kernel #2286
Conversation
Great, thanks. I have been trying off and on to make fenrus75's dgemm code work but was not really getting anywhere with it. |
Seems the OSX libc lacks aligned_malloc, and a gcc version later than 7 is required for the |
And I am happy to confirm that this passes all tests in BLAS-Tester, the DGELSD test from #2187 and also isuruf's DTRMM reproducer from #1955 (showing that trmm was indeed "only" affected indirectly by the change of incopy, no other side effect). The lapack tests show two errors in DHS although the detailed testing_results.txt claims all tests passed. |
I'm surprised that appveyor passes even though there's no |
I suspect the one Appveyor job that uses DYNAMIC_ARCH=1 does not use an AVX512-aware clang. |
If https://stackoverflow.com/questions/196329/osx-lacks-memalign is to be trusted, posix_memalign is not available on older versions of OSX either (but aligned_malloc is available on Windows, so probably make that one #ifndef OS_DARWIN - not sure what to use on that platform though). |
double *b_scratch = (double *)aligned_alloc(64,192*k); | ||
double *b_scratch; | ||
posix_memalign(&b_scratch,64,192*k); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than replacing this, turning it into
#if defined(OS_WINDOWS)
... _aligned_malloc... (note the added undescore and "M")
#else
... posix_memalign ...
#endif
would probably take care of platform issues ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the suggestions.
Later I found caching part of blocked B matrix into b_scratch actually makes the program slow down a bit... so no need for aligned allocation.
A 4x8 kernel is added for direct interface to icopy_4, which gives improved 1-thread performance (85 GFLOPS) on Intel Xeon Platinum 8269CY (theoretical 105 GFLOPS, MKL 95 GFLOPS). |
Great, thanks. Just let me know when you think it is ready for merging. I started to write a simple 8x8 trmm kernel earlier but could not get that to work, not sure if I made a silly mistake or if the existing generic kernels I used as guidance are broken. (This project has accumulated a few kernels of questionable status that are not in use by any target.) |
I've finished updating the 2 kernel files. They will soon be ready for merging after I finish reliability tests on them. |
For the 2 kernels, 1-thread tests with all combinations of 1<=m<=1173 && 1<=n<=85 && 1<=k<=267 passed. OK to merge. |
Originally designed for icopy_8 and ocopy_8, add transformation function to support icopy_4 (with ~5% performance loss), currently 80-85% MKL performance.