Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX512 DGEMM kernel #2286

Merged
merged 9 commits into from
Oct 20, 2019
Merged

AVX512 DGEMM kernel #2286

merged 9 commits into from
Oct 20, 2019

Conversation

wjc404
Copy link
Contributor

@wjc404 wjc404 commented Oct 15, 2019

Originally designed for icopy_8 and ocopy_8, add transformation function to support icopy_4 (with ~5% performance loss), currently 80-85% MKL performance.

@martin-frbg
Copy link
Collaborator

Great, thanks. I have been trying off and on to make fenrus75's dgemm code work but was not really getting anywhere with it.

@martin-frbg
Copy link
Collaborator

Seems the OSX libc lacks aligned_malloc, and a gcc version later than 7 is required for the
"_mm256_cvtsd_f64" intrinsic.

@martin-frbg
Copy link
Collaborator

And I am happy to confirm that this passes all tests in BLAS-Tester, the DGELSD test from #2187 and also isuruf's DTRMM reproducer from #1955 (showing that trmm was indeed "only" affected indirectly by the change of incopy, no other side effect). The lapack tests show two errors in DHS although the detailed testing_results.txt claims all tests passed.

@isuruf
Copy link
Contributor

isuruf commented Oct 16, 2019

I'm surprised that appveyor passes even though there's no posix_memalign on windows.

@martin-frbg
Copy link
Collaborator

I suspect the one Appveyor job that uses DYNAMIC_ARCH=1 does not use an AVX512-aware clang.
(And I have restarted the xcode job that appeared to have timed out)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 16, 2019

If https://stackoverflow.com/questions/196329/osx-lacks-memalign is to be trusted, posix_memalign is not available on older versions of OSX either (but aligned_malloc is available on Windows, so probably make that one #ifndef OS_DARWIN - not sure what to use on that platform though).
Update: OSX appears to have posix_memalign since 10.6 "Snow Leopard" from 2009 so there is
probably no need for a workaround there, just use the aligned_alloc on OS_WINDOWS and posix_memalign elsewhere.

Comment on lines 432 to 433
double *b_scratch = (double *)aligned_alloc(64,192*k);
double *b_scratch;
posix_memalign(&b_scratch,64,192*k);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than replacing this, turning it into

#if defined(OS_WINDOWS)
 ... _aligned_malloc... (note the added undescore and "M") 
#else
... posix_memalign ...
#endif

would probably take care of platform issues ?

Copy link
Contributor Author

@wjc404 wjc404 Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the suggestions.
Later I found caching part of blocked B matrix into b_scratch actually makes the program slow down a bit... so no need for aligned allocation.

@wjc404
Copy link
Contributor Author

wjc404 commented Oct 19, 2019

A 4x8 kernel is added for direct interface to icopy_4, which gives improved 1-thread performance (85 GFLOPS) on Intel Xeon Platinum 8269CY (theoretical 105 GFLOPS, MKL 95 GFLOPS).

@martin-frbg
Copy link
Collaborator

Great, thanks. Just let me know when you think it is ready for merging. I started to write a simple 8x8 trmm kernel earlier but could not get that to work, not sure if I made a silly mistake or if the existing generic kernels I used as guidance are broken. (This project has accumulated a few kernels of questionable status that are not in use by any target.)

@wjc404
Copy link
Contributor Author

wjc404 commented Oct 19, 2019

I've finished updating the 2 kernel files. They will soon be ready for merging after I finish reliability tests on them.

@wjc404
Copy link
Contributor Author

wjc404 commented Oct 20, 2019

For the 2 kernels, 1-thread tests with all combinations of 1<=m<=1173 && 1<=n<=85 && 1<=k<=267 passed. OK to merge.
Screenshot from 2019-10-20 17-30-41
Screenshot from 2019-10-20 10-55-06
Testing program is attached here:
dgemmtest.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants