-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update "dgemm_kernel_4x8_haswell.S" for improving performance on zen2 chips #2186
Conversation
replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128
Thank you - unfortunately I did not get around to running any benchmarks on my hardware yet, but this looks great. (And if necessary at any time, it would be no problem to create Zen-specific kernels - the TARGET is already there with few differences in the parameters, it just happens to reuse Haswell kernels for now.) |
In addition, I modified some prefetcht0 instructions in the dgemm kernel code to reduce LLC misses, observed an additional 0.4GFLOPS improvement on 9900K (1thread) after this change. |
Adjusting the value of macro B_PRI can further reduce L1D cache misses (CYCLES_STALL_L1D_MISS decreased by 20-30Mcyc/s on 9900K as indicated by linux perf). |
Pity there is no documentation why the original values were chosen (though it could be as simple as something copied from an earlier generation cpu kernel) |
Some prefetcht0 instructions were added to the macro SAVE4x12 to prefetch the head part of b block matrix buffer as the access to the buffer will jump to its head (hardware prefetcher may not be cleaver enough to predict that) after SAVE4x12 completes. The single-thread speed improved by 0.2-0.3GFLOPS after the change. |
The prefetch code dealing with elements of matrix C was further optimized, gained another 0.2GFLOPS improvement. |
Have you tested the impact of this change on Zen2? |
After those optimizations of prefetch , the performance on zen2 increased by 0.5 GFLOPS (1 thread). |
Sounds great, thank you for your contribution! |
Sorry don't mean to hijack this thread. How did you get it to compile? I'm following instructions in (https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio), on Ryzen 3700x and I get the following error:
After a bit of googling, this error seems to be related to new CPUs #1018 [Edit] I'm building off a git checkout of v0.3.6 [Edit 2] Ok found [Edit 3] Nevermind, looks like it's in the works Commit 0ba29fd |
@stevenwong I think 0.3.6 should be able to autodetect your cpu (family 15, extended family 8, model 1 or 8 for any generation ZEN if https://en.wikichip.org/wiki/amd/cpuid is to be trusted). Possibly something went wrong with the cpuid call in getarch, which variant of the build methods described on the wiki page did you use ? |
@martin-frbg Method 1 Native (MSVC) ABI, WIndows 10 and VC++ 2015 build tools. [Edit] Rather than building it in miniconda/base, I've created an empty environment in my regular Anaconda install and done everything in there. I've added the environment to Actually, the output suggest there may have been an issue with the compiler. Output and error logs here: https://gist.github.com/stevenwong/56fbd1fe33ebc6f58e301b4602c07bea |
The logs look normal, just that cmake is trying different option dialects to identify the compiler and assembler before it even starts to process the OpenBLAS files. I wonder where the config.h in your root directory came from ? |
I tested dgemm performance of OpenBLAS previously on a ryzen 7 3700X cpu (see my comment on issue #2180) and got only 70% theoretical performance, in contrast to 90% on i9-9900K.
Through controlled modification of kernel codes, I found the performance loss was mainly due to the high latency and low throughput of vpermpd instruction on zen2 (see the comments on issue #2180 for details).
As a result I replaced most vpermpd instructions in the file "dgemm_kernel_4x8_haswell.S", with much cheaper instructions like vpermilpd and vperm2f128.
After the modification of "dgemm_kernel_4x8_haswell.S" and recompilation, the single-thread performance of dgemm on r7-3700x (turbo boost disabled, 3.6 GHz) improved substantially, from 40-41 GFLOPS to 51-52 GFLOPS (90% theoretical), without loss of the quality of results (test program and terminal outputs are in the attachment).
test_results.zip