Some thoughts for improving Haswell sgemm performance #2210

wjc404 · 2019-08-11T09:57:22Z

I tested the SGEMM performance of OpenBLAS(haswell) against Intel MKL(2018) with large matrices on i9 9900K. The 1-thread speed of OpenBLAS was 90% of MKL and the 8-thread speed was 80% of MKL.

By linux perf I discovered the performance penalty was mainly due to private cache misses. However adding prefetch instructions didn't work. It looks like the problem is originated from cache bandwidths.

The current SGEMM kernel for Haswell (KERNEL16x6) reads 10.7 bytes from packed matrix A and 4 bytes from buffered matrix B per CPU cycle, if the 2 FMA units are running all the time. For i9-9900K at 4 GHz, this means 43 GB/s read from packed A and 16 GB/s read from buffered B. Under 64-bit Linux OS, the size of packed matrix A is 1152 kB so it should sit in L3 cache; buffered matrix B occupies 9 kB so it should stay in L1 cache.

I previously found (via AIDA64) the read bandwidth of L3 on 9900K is 315 GB/s under default settings. With 8 threads, the required bandwidth for reading from packed A is 341 GB/s at 4 GHz CPU clock, when L3 cache becomes a bottleneck.

For serial execution, the bandwidth of L3 may not be a problem, but the design of reading a larger array with faster speed is a little inappropriate and more likely to encounter cache and TLB misses.

I suggest changing the calculation kernel to KERNEL8x12 or KERNEL4x24, with modifications of other parts of the kernel code to fit it. (That's a huge amount of work, which I can't do all by myself...but it's the only way to slow down reading on A) The icopy and ocopy routines need corresponding changes.

For example:
.macro KERNEL4x24_SUB
vmovups -24 * SIZE(BO), %ymm1
vmovups -16 * SIZE(BO), %ymm2
vmovups -8 * SIZE(BO), %ymm3
addq $ 24 * SIZE, BO
vbroadcastss -4 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm4
vfmadd231ps %ymm0 ,%ymm2 ,%ymm5
vfmadd231ps %ymm0 ,%ymm3 ,%ymm6
vbroadcastss -3 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm7
vfmadd231ps %ymm0 ,%ymm2 ,%ymm8
vfmadd231ps %ymm0 ,%ymm3 ,%ymm9
vbroadcastss -2 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm10
vfmadd231ps %ymm0 ,%ymm2 ,%ymm11
vfmadd231ps %ymm0 ,%ymm3 ,%ymm12
vbroadcastss -1 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm13
vfmadd231ps %ymm0 ,%ymm2 ,%ymm14
vfmadd231ps %ymm0 ,%ymm3 ,%ymm15
addq $ 4 * SIZE, AO
decq %rax
.endm

martin-frbg · 2019-08-11T18:01:31Z

Quite convincing, though unfortunately with the current code structure at least the STRMM kernel would need to be rewritten to match as well. Current "Haswell" kernels were written for the original i7-4xxx generation and may even have compared well to a 2013 MKL - unfortunately there was no experienced x86 assembly programmer around to continue where wernsaar left off.
For the multithreaded case possibly the earlier changes to the GEMM workload splitting at the driver level (#1316, #1320) and the more recent increase of its SWITCH_RATIO (#1622) may need revisiting as well.

brada4 · 2019-08-13T04:47:55Z

Sice you might take on this, also there is word alignment at best, and work splitter might take on handling head so that rest of jobs gets aligned to cache line or page, as found optimal, one tlb miss always better than two, also if multiple processing numa nodes do not share closer caches.

wjc404 · 2019-08-16T16:02:41Z

I tried adjusting SGEMM_DEFAULT_P to 192 and SGEMM_DEFAULT_Q to 256, in order to reduce the size of packed A to 192kB (fit in L2 cache). After this change, 1-thread performance dropped 1% but 8-thread performance rose 7-8%, on 9900K.

brada4 · 2019-08-16T18:29:48Z

Consider spectre mitigations on task switch and eventual virtualisation, and try to tune it to less, then step back at the point performance starts to drop.

martin-frbg · 2019-08-17T17:51:06Z

Not sure if we really need to worry about spectre fixes here... a one percent drop in single-thread performance is a fair price to pay for the gain I guess - and we could even add an ifdef to keep the old values for USE_THREAD=0 builds.

brada4 · 2019-08-18T08:30:23Z

Not exactly that, just the idea that effective userland cache size is less than on the CPU label because of mandatory predators...

TiborGY · 2019-08-19T10:36:58Z

because of mandatory predators...

What is a "mandatory predator"?

Anyways, I think if at any point we try to work around CPU vulnerability mitigations (what a clusterfuck), we need to consider that many HPC clusters and similar environments are NOT using any mitigations, to avoid any performance impact. So mitigation workarounds should be optional.

I know for a fact that NONE of the HPC clusters I have access to have any mitigations enabled, not even Meltdown mitigations.

martin-frbg · 2019-08-19T11:25:13Z

@TiborGY ssshh... don't you know you can't join social media without one ?

TiborGY · 2019-08-19T11:50:50Z

@TiborGY ssshh... don't you know you can't join social media without one ?

Lol, seriously, I haven't got the slightest clue what a mandatory predator is.

martin-frbg · 2019-08-19T12:16:51Z

guessing brada4's native language has a similar-looking word that means something like "protection" or "workaround". More importantly, I am not aware that any of the microcode fixes ended up reducing available L2 cache size at all ?

brada4 · 2019-08-19T15:23:47Z

It is approximation that now bigger code handles task switching and syscall entry. For all practical purposes we should assume some space for that (we cannot workaround forced full cache flush though)
That is I tried to explain 1% perf loss without giving sufficient introduction context.

wjc404 · 2019-12-11T11:34:00Z

I've finished with KERNEL_4x24 and KERNEL_8x12. Simple copy and driver functions were added to the kernel codes for debugging and benchmarking(1-thread). They both perform better than the current OpenBLAS(KERNEL_16x6) on i9 9900K and r7 3700x. They will be ready to merge when the corresponding strmm and strsm kernels are ready.

1-thread performance on i9-9900K at 4.4 GHz:
MKL2019...........133 GFLOPS
OpenBLAS........118 GFLOPS
KERNEL_4x24..126 GFLOPS
KERNEL_8x12..130 GFLOPS
Theoretical.........141 GFLOPS

wjc404 mentioned this issue Nov 1, 2019

Optimize SGEMM on SKYLAKEX CPUs #2300

Merged

wjc404 mentioned this issue Jan 6, 2020

Optimize AVX2 SGEMM & STRMM #2361

Merged

wjc404 closed this as completed Mar 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some thoughts for improving Haswell sgemm performance #2210

Some thoughts for improving Haswell sgemm performance #2210

wjc404 commented Aug 11, 2019 •

edited

Loading

martin-frbg commented Aug 11, 2019

brada4 commented Aug 13, 2019

wjc404 commented Aug 16, 2019 •

edited

Loading

brada4 commented Aug 16, 2019

martin-frbg commented Aug 17, 2019

brada4 commented Aug 18, 2019

TiborGY commented Aug 19, 2019

martin-frbg commented Aug 19, 2019

TiborGY commented Aug 19, 2019

martin-frbg commented Aug 19, 2019

brada4 commented Aug 19, 2019 •

edited

Loading

wjc404 commented Dec 11, 2019 •

edited

Loading

Some thoughts for improving Haswell sgemm performance #2210

Some thoughts for improving Haswell sgemm performance #2210

Comments

wjc404 commented Aug 11, 2019 • edited Loading

martin-frbg commented Aug 11, 2019

brada4 commented Aug 13, 2019

wjc404 commented Aug 16, 2019 • edited Loading

brada4 commented Aug 16, 2019

martin-frbg commented Aug 17, 2019

brada4 commented Aug 18, 2019

TiborGY commented Aug 19, 2019

martin-frbg commented Aug 19, 2019

TiborGY commented Aug 19, 2019

martin-frbg commented Aug 19, 2019

brada4 commented Aug 19, 2019 • edited Loading

wjc404 commented Dec 11, 2019 • edited Loading

wjc404 commented Aug 11, 2019 •

edited

Loading

wjc404 commented Aug 16, 2019 •

edited

Loading

brada4 commented Aug 19, 2019 •

edited

Loading

wjc404 commented Dec 11, 2019 •

edited

Loading