Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update "dgemm_kernel_4x8_haswell.S" for improving performance on zen2 chips #2186

Merged
merged 7 commits into from
Jul 18, 2019

Conversation

wjc404
Copy link
Contributor

@wjc404 wjc404 commented Jul 16, 2019

I tested dgemm performance of OpenBLAS previously on a ryzen 7 3700X cpu (see my comment on issue #2180) and got only 70% theoretical performance, in contrast to 90% on i9-9900K.
Through controlled modification of kernel codes, I found the performance loss was mainly due to the high latency and low throughput of vpermpd instruction on zen2 (see the comments on issue #2180 for details).
As a result I replaced most vpermpd instructions in the file "dgemm_kernel_4x8_haswell.S", with much cheaper instructions like vpermilpd and vperm2f128.
After the modification of "dgemm_kernel_4x8_haswell.S" and recompilation, the single-thread performance of dgemm on r7-3700x (turbo boost disabled, 3.6 GHz) improved substantially, from 40-41 GFLOPS to 51-52 GFLOPS (90% theoretical), without loss of the quality of results (test program and terminal outputs are in the attachment).
test_results.zip

wjc404 added 2 commits July 17, 2019 00:46
replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128
@wjc404
Copy link
Contributor Author

wjc404 commented Jul 17, 2019

The performance on i9-9900K (1 thread, 4.4 GHz) slightly improved after this update.
Screenshot from 2019-07-17 16-36-32

@martin-frbg
Copy link
Collaborator

Thank you - unfortunately I did not get around to running any benchmarks on my hardware yet, but this looks great. (And if necessary at any time, it would be no problem to create Zen-specific kernels - the TARGET is already there with few differences in the parameters, it just happens to reuse Haswell kernels for now.)

@wjc404
Copy link
Contributor Author

wjc404 commented Jul 17, 2019

In addition, I modified some prefetcht0 instructions in the dgemm kernel code to reduce LLC misses, observed an additional 0.4GFLOPS improvement on 9900K (1thread) after this change.

@wjc404
Copy link
Contributor Author

wjc404 commented Jul 17, 2019

Adjusting the value of macro B_PRI can further reduce L1D cache misses (CYCLES_STALL_L1D_MISS decreased by 20-30Mcyc/s on 9900K as indicated by linux perf).

@martin-frbg
Copy link
Collaborator

Pity there is no documentation why the original values were chosen (though it could be as simple as something copied from an earlier generation cpu kernel)

@wjc404
Copy link
Contributor Author

wjc404 commented Jul 17, 2019

Some prefetcht0 instructions were added to the macro SAVE4x12 to prefetch the head part of b block matrix buffer as the access to the buffer will jump to its head (hardware prefetcher may not be cleaver enough to predict that) after SAVE4x12 completes. The single-thread speed improved by 0.2-0.3GFLOPS after the change.

@wjc404
Copy link
Contributor Author

wjc404 commented Jul 17, 2019

The prefetch code dealing with elements of matrix C was further optimized, gained another 0.2GFLOPS improvement.

@TiborGY
Copy link
Contributor

TiborGY commented Jul 18, 2019

In addition, I modified some prefetcht0 instructions in the dgemm kernel code to reduce LLC misses, observed an additional 0.4GFLOPS improvement on 9900K (1thread) after this change.

Have you tested the impact of this change on Zen2?

@wjc404
Copy link
Contributor Author

wjc404 commented Jul 18, 2019

After those optimizations of prefetch , the performance on zen2 increased by 0.5 GFLOPS (1 thread).

@TiborGY
Copy link
Contributor

TiborGY commented Jul 18, 2019

Sounds great, thank you for your contribution!

@martin-frbg martin-frbg added this to the 0.3.7 milestone Jul 18, 2019
@martin-frbg martin-frbg merged commit b0b7600 into OpenMathLib:develop Jul 18, 2019
@stevenwong
Copy link

stevenwong commented Jul 23, 2019

Sorry don't mean to hijack this thread. How did you get it to compile?

I'm following instructions in (https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio), on Ryzen 3700x and I get the following error:

  C:\Users\Steven\Documents\OpenBLAS\getarch_2nd.c(12,35): error: use of
  undeclared identifier 'SGEMM_DEFAULT_UNROLL_M'

      printf("SGEMM_UNROLL_M=%d\n", SGEMM_DEFAULT_UNROLL_M);
                                    ^

  C:\Users\Steven\Documents\OpenBLAS\getarch_2nd.c(13,35): error: use of
  undeclared identifier 'SGEMM_DEFAULT_UNROLL_N'

      printf("SGEMM_UNROLL_N=%d\n", SGEMM_DEFAULT_UNROLL_N);
                                    ^

After a bit of googling, this error seems to be related to new CPUs #1018

[Edit] I'm building off a git checkout of v0.3.6

[Edit 2] Ok found TargetList.txt, seems like I need to set TARGET=ZEN

[Edit 3] Nevermind, looks like it's in the works Commit 0ba29fd

@martin-frbg
Copy link
Collaborator

@stevenwong I think 0.3.6 should be able to autodetect your cpu (family 15, extended family 8, model 1 or 8 for any generation ZEN if https://en.wikichip.org/wiki/amd/cpuid is to be trusted). Possibly something went wrong with the cpuid call in getarch, which variant of the build methods described on the wiki page did you use ?

@stevenwong
Copy link

stevenwong commented Jul 23, 2019

@martin-frbg Method 1 Native (MSVC) ABI, WIndows 10 and VC++ 2015 build tools.

[Edit]
Ok, following the instructions in #1459 and removing config.h in the root directory fixed it.
[\Edit]

Rather than building it in miniconda/base, I've created an empty environment in my regular Anaconda install and done everything in there. I've added the environment to LIB and CPATH.

Actually, the output suggest there may have been an issue with the compiler. Output and error logs here:

https://gist.github.com/stevenwong/56fbd1fe33ebc6f58e301b4602c07bea
https://gist.github.com/stevenwong/4ba1f89c6d86b3bc9a03e5df44441c29

@martin-frbg
Copy link
Collaborator

The logs look normal, just that cmake is trying different option dialects to identify the compiler and assembler before it even starts to process the OpenBLAS files. I wonder where the config.h in your root directory came from ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants