Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some thoughts for improving Haswell sgemm performance #2210

Closed
wjc404 opened this issue Aug 11, 2019 · 12 comments
Closed

Some thoughts for improving Haswell sgemm performance #2210

wjc404 opened this issue Aug 11, 2019 · 12 comments

Comments

@wjc404
Copy link
Contributor

wjc404 commented Aug 11, 2019

I tested the SGEMM performance of OpenBLAS(haswell) against Intel MKL(2018) with large matrices on i9 9900K. The 1-thread speed of OpenBLAS was 90% of MKL and the 8-thread speed was 80% of MKL.

By linux perf I discovered the performance penalty was mainly due to private cache misses. However adding prefetch instructions didn't work. It looks like the problem is originated from cache bandwidths.

The current SGEMM kernel for Haswell (KERNEL16x6) reads 10.7 bytes from packed matrix A and 4 bytes from buffered matrix B per CPU cycle, if the 2 FMA units are running all the time. For i9-9900K at 4 GHz, this means 43 GB/s read from packed A and 16 GB/s read from buffered B. Under 64-bit Linux OS, the size of packed matrix A is 1152 kB so it should sit in L3 cache; buffered matrix B occupies 9 kB so it should stay in L1 cache.

I previously found (via AIDA64) the read bandwidth of L3 on 9900K is 315 GB/s under default settings. With 8 threads, the required bandwidth for reading from packed A is 341 GB/s at 4 GHz CPU clock, when L3 cache becomes a bottleneck.

For serial execution, the bandwidth of L3 may not be a problem, but the design of reading a larger array with faster speed is a little inappropriate and more likely to encounter cache and TLB misses.

I suggest changing the calculation kernel to KERNEL8x12 or KERNEL4x24, with modifications of other parts of the kernel code to fit it. (That's a huge amount of work, which I can't do all by myself...but it's the only way to slow down reading on A) The icopy and ocopy routines need corresponding changes.

For example:
.macro KERNEL4x24_SUB
vmovups -24 * SIZE(BO), %ymm1
vmovups -16 * SIZE(BO), %ymm2
vmovups -8 * SIZE(BO), %ymm3
addq $ 24 * SIZE, BO
vbroadcastss -4 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm4
vfmadd231ps %ymm0 ,%ymm2 ,%ymm5
vfmadd231ps %ymm0 ,%ymm3 ,%ymm6
vbroadcastss -3 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm7
vfmadd231ps %ymm0 ,%ymm2 ,%ymm8
vfmadd231ps %ymm0 ,%ymm3 ,%ymm9
vbroadcastss -2 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm10
vfmadd231ps %ymm0 ,%ymm2 ,%ymm11
vfmadd231ps %ymm0 ,%ymm3 ,%ymm12
vbroadcastss -1 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm13
vfmadd231ps %ymm0 ,%ymm2 ,%ymm14
vfmadd231ps %ymm0 ,%ymm3 ,%ymm15
addq $ 4 * SIZE, AO
decq %rax
.endm

@martin-frbg
Copy link
Collaborator

Quite convincing, though unfortunately with the current code structure at least the STRMM kernel would need to be rewritten to match as well. Current "Haswell" kernels were written for the original i7-4xxx generation and may even have compared well to a 2013 MKL - unfortunately there was no experienced x86 assembly programmer around to continue where wernsaar left off.
For the multithreaded case possibly the earlier changes to the GEMM workload splitting at the driver level (#1316, #1320) and the more recent increase of its SWITCH_RATIO (#1622) may need revisiting as well.

@brada4
Copy link
Contributor

brada4 commented Aug 13, 2019

Sice you might take on this, also there is word alignment at best, and work splitter might take on handling head so that rest of jobs gets aligned to cache line or page, as found optimal, one tlb miss always better than two, also if multiple processing numa nodes do not share closer caches.

@wjc404
Copy link
Contributor Author

wjc404 commented Aug 16, 2019

I tried adjusting SGEMM_DEFAULT_P to 192 and SGEMM_DEFAULT_Q to 256, in order to reduce the size of packed A to 192kB (fit in L2 cache). After this change, 1-thread performance dropped 1% but 8-thread performance rose 7-8%, on 9900K.

@brada4
Copy link
Contributor

brada4 commented Aug 16, 2019

Consider spectre mitigations on task switch and eventual virtualisation, and try to tune it to less, then step back at the point performance starts to drop.

@martin-frbg
Copy link
Collaborator

Not sure if we really need to worry about spectre fixes here... a one percent drop in single-thread performance is a fair price to pay for the gain I guess - and we could even add an ifdef to keep the old values for USE_THREAD=0 builds.

@brada4
Copy link
Contributor

brada4 commented Aug 18, 2019

Not exactly that, just the idea that effective userland cache size is less than on the CPU label because of mandatory predators...

@TiborGY
Copy link
Contributor

TiborGY commented Aug 19, 2019

because of mandatory predators...

What is a "mandatory predator"?

Anyways, I think if at any point we try to work around CPU vulnerability mitigations (what a clusterfuck), we need to consider that many HPC clusters and similar environments are NOT using any mitigations, to avoid any performance impact. So mitigation workarounds should be optional.

I know for a fact that NONE of the HPC clusters I have access to have any mitigations enabled, not even Meltdown mitigations.

@martin-frbg
Copy link
Collaborator

@TiborGY ssshh... don't you know you can't join social media without one ?

@TiborGY
Copy link
Contributor

TiborGY commented Aug 19, 2019

@TiborGY ssshh... don't you know you can't join social media without one ?

Lol, seriously, I haven't got the slightest clue what a mandatory predator is.

@martin-frbg
Copy link
Collaborator

guessing brada4's native language has a similar-looking word that means something like "protection" or "workaround". More importantly, I am not aware that any of the microcode fixes ended up reducing available L2 cache size at all ?

@brada4
Copy link
Contributor

brada4 commented Aug 19, 2019

It is approximation that now bigger code handles task switching and syscall entry. For all practical purposes we should assume some space for that (we cannot workaround forced full cache flush though)
That is I tried to explain 1% perf loss without giving sufficient introduction context.

@wjc404
Copy link
Contributor Author

wjc404 commented Dec 11, 2019

I've finished with KERNEL_4x24 and KERNEL_8x12. Simple copy and driver functions were added to the kernel codes for debugging and benchmarking(1-thread). They both perform better than the current OpenBLAS(KERNEL_16x6) on i9 9900K and r7 3700x. They will be ready to merge when the corresponding strmm and strsm kernels are ready.

1-thread performance on i9-9900K at 4.4 GHz:
MKL2019...........133 GFLOPS
OpenBLAS........118 GFLOPS
KERNEL_4x24..126 GFLOPS
KERNEL_8x12..130 GFLOPS
Theoretical.........141 GFLOPS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants