-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some thoughts for improving Haswell sgemm performance #2210
Comments
Quite convincing, though unfortunately with the current code structure at least the STRMM kernel would need to be rewritten to match as well. Current "Haswell" kernels were written for the original i7-4xxx generation and may even have compared well to a 2013 MKL - unfortunately there was no experienced x86 assembly programmer around to continue where wernsaar left off. |
Sice you might take on this, also there is word alignment at best, and work splitter might take on handling head so that rest of jobs gets aligned to cache line or page, as found optimal, one tlb miss always better than two, also if multiple processing numa nodes do not share closer caches. |
I tried adjusting SGEMM_DEFAULT_P to 192 and SGEMM_DEFAULT_Q to 256, in order to reduce the size of packed A to 192kB (fit in L2 cache). After this change, 1-thread performance dropped 1% but 8-thread performance rose 7-8%, on 9900K. |
Consider spectre mitigations on task switch and eventual virtualisation, and try to tune it to less, then step back at the point performance starts to drop. |
Not sure if we really need to worry about spectre fixes here... a one percent drop in single-thread performance is a fair price to pay for the gain I guess - and we could even add an ifdef to keep the old values for USE_THREAD=0 builds. |
Not exactly that, just the idea that effective userland cache size is less than on the CPU label because of mandatory predators... |
What is a "mandatory predator"? Anyways, I think if at any point we try to work around CPU vulnerability mitigations (what a clusterfuck), we need to consider that many HPC clusters and similar environments are NOT using any mitigations, to avoid any performance impact. So mitigation workarounds should be optional. I know for a fact that NONE of the HPC clusters I have access to have any mitigations enabled, not even Meltdown mitigations. |
@TiborGY ssshh... don't you know you can't join social media without one ? |
Lol, seriously, I haven't got the slightest clue what a mandatory predator is. |
guessing brada4's native language has a similar-looking word that means something like "protection" or "workaround". More importantly, I am not aware that any of the microcode fixes ended up reducing available L2 cache size at all ? |
It is approximation that now bigger code handles task switching and syscall entry. For all practical purposes we should assume some space for that (we cannot workaround forced full cache flush though) |
I've finished with KERNEL_4x24 and KERNEL_8x12. Simple copy and driver functions were added to the kernel codes for debugging and benchmarking(1-thread). They both perform better than the current OpenBLAS(KERNEL_16x6) on i9 9900K and r7 3700x. They will be ready to merge when the corresponding strmm and strsm kernels are ready. 1-thread performance on i9-9900K at 4.4 GHz: |
I tested the SGEMM performance of OpenBLAS(haswell) against Intel MKL(2018) with large matrices on i9 9900K. The 1-thread speed of OpenBLAS was 90% of MKL and the 8-thread speed was 80% of MKL.
By linux perf I discovered the performance penalty was mainly due to private cache misses. However adding prefetch instructions didn't work. It looks like the problem is originated from cache bandwidths.
The current SGEMM kernel for Haswell (KERNEL16x6) reads 10.7 bytes from packed matrix A and 4 bytes from buffered matrix B per CPU cycle, if the 2 FMA units are running all the time. For i9-9900K at 4 GHz, this means 43 GB/s read from packed A and 16 GB/s read from buffered B. Under 64-bit Linux OS, the size of packed matrix A is 1152 kB so it should sit in L3 cache; buffered matrix B occupies 9 kB so it should stay in L1 cache.
I previously found (via AIDA64) the read bandwidth of L3 on 9900K is 315 GB/s under default settings. With 8 threads, the required bandwidth for reading from packed A is 341 GB/s at 4 GHz CPU clock, when L3 cache becomes a bottleneck.
For serial execution, the bandwidth of L3 may not be a problem, but the design of reading a larger array with faster speed is a little inappropriate and more likely to encounter cache and TLB misses.
I suggest changing the calculation kernel to KERNEL8x12 or KERNEL4x24, with modifications of other parts of the kernel code to fit it. (That's a huge amount of work, which I can't do all by myself...but it's the only way to slow down reading on A) The icopy and ocopy routines need corresponding changes.
For example:
.macro KERNEL4x24_SUB
vmovups -24 * SIZE(BO), %ymm1
vmovups -16 * SIZE(BO), %ymm2
vmovups -8 * SIZE(BO), %ymm3
addq $ 24 * SIZE, BO
vbroadcastss -4 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm4
vfmadd231ps %ymm0 ,%ymm2 ,%ymm5
vfmadd231ps %ymm0 ,%ymm3 ,%ymm6
vbroadcastss -3 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm7
vfmadd231ps %ymm0 ,%ymm2 ,%ymm8
vfmadd231ps %ymm0 ,%ymm3 ,%ymm9
vbroadcastss -2 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm10
vfmadd231ps %ymm0 ,%ymm2 ,%ymm11
vfmadd231ps %ymm0 ,%ymm3 ,%ymm12
vbroadcastss -1 * SIZE(AO), %ymm0
vfmadd231ps %ymm0 ,%ymm1 ,%ymm13
vfmadd231ps %ymm0 ,%ymm2 ,%ymm14
vfmadd231ps %ymm0 ,%ymm3 ,%ymm15
addq $ 4 * SIZE, AO
decq %rax
.endm
The text was updated successfully, but these errors were encountered: