Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "sgemm direct" mode for small matrixes #1914

Merged
merged 1 commit into from
Dec 13, 2018

Conversation

fenrus75
Copy link
Contributor

@fenrus75 fenrus75 commented Dec 12, 2018

OpenBLAS has a fancy algorithm for copying the input data while laying
it out in a more CPU friendly memory layout.

This is great for large matrixes; the cost of the copy is easily
ammortized by the gains from the better memory layout.

But for small matrixes (on CPUs that can do efficient unaligned loads) this
copy can be a net loss.

This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses
the whole copy machinary for ALPHA=1/BETA=0/... standard arguments,
for small matrixes only.

What is small? For the non-threaded case this has been measured to be
in the MNK = 28 * 512 * 512 range, while in the threaded case it's
less, around MNK = 1 * 512 * 512

single threaded performance results:

M N K cycles improvement
1 1 1 84.3 2.9x
2 2 2 98.4 2.7x
3 3 3 190.1 1.6x
4 4 4 122.0 2.9x
6 6 6 285.1 2.0x
8 8 8 214.0 2.2x
16 16 16 495.6 2.0x
32 32 32 1756.7 1.7x
48 48 48 6436.2 1.4x
64 64 64 10956.8 1.5x
96 96 96 39499.2 1.3x
128 128 128 87406.9 1.3x
192 192 192 291711.4 1.2x
256 256 256 834721.4 2.0%
512 512 512 6732405.0 2.2%
1024 1024 1024 55072972.0 -1.0%
4 16 9 163.9 2.7x
64 128 192 64114.1 1.4x
37 81 193 45321.8 0.2%
512 412 800 9292010.9 -0.7%
256 1 256 21914.6 2.5x
256 2 256 34736.0 2.3x
256 4 256 67692.1 1.8x
256 8 256 74606.8 1.6x
256 16 256 80248.1 1.6x
256 32 256 98220.7 1.5x
256 64 256 170672.0 1.5x
1 256 256 6987.7 5.6x
2 256 256 7426.5 5.8x
4 256 256 10920.4 3.4x
8 256 256 22304.4 2.3x
16 256 256 43015.7 1.8x
32 256 256 85714.6 1.5x
64 256 256 172102.0 1.3x
256 256 1 18217.9 1.9x
256 256 2 19527.2 1.8x
256 256 4 17520.1 2.2x
256 256 8 25627.7 2.0x
256 256 16 44892.1 1.6x
256 256 32 95202.3 1.3x
256 256 64 181141.2 1.3x
35 8457 1760 84758181.7 -3.8%

Copy link
Collaborator

@martin-frbg martin-frbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSVC probably expects __restrict here

@fenrus75
Copy link
Contributor Author

yup __restrict solves that, but it looks like AppVeyor timed out or something

OpenBLAS has a fancy algorithm for copying the input data while laying
it out in a more CPU friendly memory layout.

This is great for large matrixes; the cost of the copy is easily
ammortized by the gains from the better memory layout.

But for small matrixes (on CPUs that can do efficient unaligned loads) this
copy can be a net loss.

This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses
the whole copy machinary for ALPHA=1/BETA=0/... standard arguments,
for small matrixes only.

What is small? For the non-threaded case this has been measured to be
in the M*N*K = 28 * 512 * 512 range, while in the threaded case it's
less, around M*N*K = 1 * 512 * 512
@fenrus75
Copy link
Contributor Author

(changed whitespace in the commit message to trigger CI to rerun)

@fenrus75
Copy link
Contributor Author

hm. now appveyor passes but travis did not, so reverse of before.
with zero changes in code...

@martin-frbg
Copy link
Collaborator

I can restart travis jobs that had a spurious failure, but apparently I cannot do the same with appveyor.
Both tend to fail occasionally when network issues make the setup process fail or take longer than usual.

@martin-frbg martin-frbg merged commit 78d877b into OpenMathLib:develop Dec 13, 2018
@fenrus75 fenrus75 deleted the smallmatrix branch December 13, 2018 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants