Add a "sgemm direct" mode for small matrixes #1914

fenrus75 · 2018-12-12T17:31:59Z

OpenBLAS has a fancy algorithm for copying the input data while laying
it out in a more CPU friendly memory layout.

This is great for large matrixes; the cost of the copy is easily
ammortized by the gains from the better memory layout.

But for small matrixes (on CPUs that can do efficient unaligned loads) this
copy can be a net loss.

This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses
the whole copy machinary for ALPHA=1/BETA=0/... standard arguments,
for small matrixes only.

What is small? For the non-threaded case this has been measured to be
in the MNK = 28 * 512 * 512 range, while in the threaded case it's
less, around MNK = 1 * 512 * 512

single threaded performance results:

M	N	K	cycles	improvement
1	1	1	84.3	2.9x
2	2	2	98.4	2.7x
3	3	3	190.1	1.6x
4	4	4	122.0	2.9x
6	6	6	285.1	2.0x
8	8	8	214.0	2.2x
16	16	16	495.6	2.0x
32	32	32	1756.7	1.7x
48	48	48	6436.2	1.4x
64	64	64	10956.8	1.5x
96	96	96	39499.2	1.3x
128	128	128	87406.9	1.3x
192	192	192	291711.4	1.2x
256	256	256	834721.4	2.0%
512	512	512	6732405.0	2.2%
1024	1024	1024	55072972.0	-1.0%
4	16	9	163.9	2.7x
64	128	192	64114.1	1.4x
37	81	193	45321.8	0.2%
512	412	800	9292010.9	-0.7%
256	1	256	21914.6	2.5x
256	2	256	34736.0	2.3x
256	4	256	67692.1	1.8x
256	8	256	74606.8	1.6x
256	16	256	80248.1	1.6x
256	32	256	98220.7	1.5x
256	64	256	170672.0	1.5x
1	256	256	6987.7	5.6x
2	256	256	7426.5	5.8x
4	256	256	10920.4	3.4x
8	256	256	22304.4	2.3x
16	256	256	43015.7	1.8x
32	256	256	85714.6	1.5x
64	256	256	172102.0	1.3x
256	256	1	18217.9	1.9x
256	256	2	19527.2	1.8x
256	256	4	17520.1	2.2x
256	256	8	25627.7	2.0x
256	256	16	44892.1	1.6x
256	256	32	95202.3	1.3x
256	256	64	181141.2	1.3x
35	8457	1760	84758181.7	-3.8%

martin-frbg

MSVC probably expects __restrict here

fenrus75 · 2018-12-12T23:39:21Z

yup __restrict solves that, but it looks like AppVeyor timed out or something

OpenBLAS has a fancy algorithm for copying the input data while laying it out in a more CPU friendly memory layout. This is great for large matrixes; the cost of the copy is easily ammortized by the gains from the better memory layout. But for small matrixes (on CPUs that can do efficient unaligned loads) this copy can be a net loss. This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses the whole copy machinary for ALPHA=1/BETA=0/... standard arguments, for small matrixes only. What is small? For the non-threaded case this has been measured to be in the M*N*K = 28 * 512 * 512 range, while in the threaded case it's less, around M*N*K = 1 * 512 * 512

fenrus75 · 2018-12-13T13:47:32Z

(changed whitespace in the commit message to trigger CI to rerun)

fenrus75 · 2018-12-13T15:46:02Z

hm. now appveyor passes but travis did not, so reverse of before.
with zero changes in code...

martin-frbg · 2018-12-13T16:19:11Z

I can restart travis jobs that had a spurious failure, but apparently I cannot do the same with appveyor.
Both tend to fail occasionally when network issues make the setup process fail or take longer than usual.

fenrus75 mentioned this pull request Dec 12, 2018

OpenBLAS vs ATLAS performance for MXNET #1897

Closed

martin-frbg reviewed Dec 12, 2018

View reviewed changes

fenrus75 force-pushed the smallmatrix branch from ad9f02c to f3bd567 Compare December 12, 2018 21:25

fenrus75 force-pushed the smallmatrix branch from f3bd567 to cdc668d Compare December 13, 2018 13:45

fenrus75 mentioned this pull request Dec 13, 2018

Performance of dgemm #1840

Open

martin-frbg merged commit 78d877b into OpenMathLib:develop Dec 13, 2018

fenrus75 deleted the smallmatrix branch December 13, 2018 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a "sgemm direct" mode for small matrixes #1914

Add a "sgemm direct" mode for small matrixes #1914

fenrus75 commented Dec 12, 2018 •

edited

Loading

martin-frbg left a comment

fenrus75 commented Dec 12, 2018

fenrus75 commented Dec 13, 2018

fenrus75 commented Dec 13, 2018

martin-frbg commented Dec 13, 2018

Add a "sgemm direct" mode for small matrixes #1914

Add a "sgemm direct" mode for small matrixes #1914

Conversation

fenrus75 commented Dec 12, 2018 • edited Loading

martin-frbg left a comment

Choose a reason for hiding this comment

fenrus75 commented Dec 12, 2018

fenrus75 commented Dec 13, 2018

fenrus75 commented Dec 13, 2018

martin-frbg commented Dec 13, 2018

fenrus75 commented Dec 12, 2018 •

edited

Loading