Update "dgemm_kernel_4x8_haswell.S" for improving performance on zen2 chips #2186

wjc404 · 2019-07-16T17:17:07Z

I tested dgemm performance of OpenBLAS previously on a ryzen 7 3700X cpu (see my comment on issue #2180) and got only 70% theoretical performance, in contrast to 90% on i9-9900K.
Through controlled modification of kernel codes, I found the performance loss was mainly due to the high latency and low throughput of vpermpd instruction on zen2 (see the comments on issue #2180 for details).
As a result I replaced most vpermpd instructions in the file "dgemm_kernel_4x8_haswell.S", with much cheaper instructions like vpermilpd and vperm2f128.
After the modification of "dgemm_kernel_4x8_haswell.S" and recompilation, the single-thread performance of dgemm on r7-3700x (turbo boost disabled, 3.6 GHz) improved substantially, from 40-41 GFLOPS to 51-52 GFLOPS (90% theoretical), without loss of the quality of results (test program and terminal outputs are in the attachment).
test_results.zip

replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128

wjc404 · 2019-07-17T08:38:03Z

The performance on i9-9900K (1 thread, 4.4 GHz) slightly improved after this update.

martin-frbg · 2019-07-17T08:49:34Z

Thank you - unfortunately I did not get around to running any benchmarks on my hardware yet, but this looks great. (And if necessary at any time, it would be no problem to create Zen-specific kernels - the TARGET is already there with few differences in the parameters, it just happens to reuse Haswell kernels for now.)

wjc404 · 2019-07-17T09:16:58Z

In addition, I modified some prefetcht0 instructions in the dgemm kernel code to reduce LLC misses, observed an additional 0.4GFLOPS improvement on 9900K (1thread) after this change.

wjc404 · 2019-07-17T13:41:28Z

Adjusting the value of macro B_PRI can further reduce L1D cache misses (CYCLES_STALL_L1D_MISS decreased by 20-30Mcyc/s on 9900K as indicated by linux perf).

martin-frbg · 2019-07-17T13:51:58Z

Pity there is no documentation why the original values were chosen (though it could be as simple as something copied from an earlier generation cpu kernel)

wjc404 · 2019-07-17T14:42:16Z

Some prefetcht0 instructions were added to the macro SAVE4x12 to prefetch the head part of b block matrix buffer as the access to the buffer will jump to its head (hardware prefetcher may not be cleaver enough to predict that) after SAVE4x12 completes. The single-thread speed improved by 0.2-0.3GFLOPS after the change.

wjc404 · 2019-07-17T16:00:22Z

The prefetch code dealing with elements of matrix C was further optimized, gained another 0.2GFLOPS improvement.

TiborGY · 2019-07-18T08:14:59Z

In addition, I modified some prefetcht0 instructions in the dgemm kernel code to reduce LLC misses, observed an additional 0.4GFLOPS improvement on 9900K (1thread) after this change.

Have you tested the impact of this change on Zen2?

wjc404 · 2019-07-18T08:45:58Z

After those optimizations of prefetch , the performance on zen2 increased by 0.5 GFLOPS (1 thread).

TiborGY · 2019-07-18T09:40:23Z

Sounds great, thank you for your contribution!

stevenwong · 2019-07-23T11:53:40Z

Sorry don't mean to hijack this thread. How did you get it to compile?

I'm following instructions in (https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio), on Ryzen 3700x and I get the following error:

  C:\Users\Steven\Documents\OpenBLAS\getarch_2nd.c(12,35): error: use of
  undeclared identifier 'SGEMM_DEFAULT_UNROLL_M'

      printf("SGEMM_UNROLL_M=%d\n", SGEMM_DEFAULT_UNROLL_M);
                                    ^

  C:\Users\Steven\Documents\OpenBLAS\getarch_2nd.c(13,35): error: use of
  undeclared identifier 'SGEMM_DEFAULT_UNROLL_N'

      printf("SGEMM_UNROLL_N=%d\n", SGEMM_DEFAULT_UNROLL_N);
                                    ^

After a bit of googling, this error seems to be related to new CPUs #1018

[Edit] I'm building off a git checkout of v0.3.6

[Edit 2] Ok found TargetList.txt, seems like I need to set TARGET=ZEN

[Edit 3] Nevermind, looks like it's in the works Commit 0ba29fd

martin-frbg · 2019-07-23T12:55:42Z

@stevenwong I think 0.3.6 should be able to autodetect your cpu (family 15, extended family 8, model 1 or 8 for any generation ZEN if https://en.wikichip.org/wiki/amd/cpuid is to be trusted). Possibly something went wrong with the cpuid call in getarch, which variant of the build methods described on the wiki page did you use ?

stevenwong · 2019-07-23T13:14:10Z

@martin-frbg Method 1 Native (MSVC) ABI, WIndows 10 and VC++ 2015 build tools.

[Edit]
Ok, following the instructions in #1459 and removing config.h in the root directory fixed it.
[\Edit]

Rather than building it in miniconda/base, I've created an empty environment in my regular Anaconda install and done everything in there. I've added the environment to LIB and CPATH.

Actually, the output suggest there may have been an issue with the compiler. Output and error logs here:

https://gist.github.com/stevenwong/56fbd1fe33ebc6f58e301b4602c07bea
https://gist.github.com/stevenwong/4ba1f89c6d86b3bc9a03e5df44441c29

martin-frbg · 2019-07-23T15:25:18Z

The logs look normal, just that cmake is trying different option dialects to identify the compiler and assembler before it even starts to process the OpenBLAS files. I wonder where the config.h in your root directory came from ?

wjc404 added 2 commits July 17, 2019 00:46

Update dgemm_kernel_4x8_haswell.S for zen2

0ba29fd

replaced a bunch of vpermpd instructions with vpermilpd and vperm2f128

Update dgemm_kernel_4x8_haswell.S

7a9050d

Update dgemm_kernel_4x8_haswell.S

182b06d

Update dgemm_kernel_4x8_haswell.S

1733f92

Update dgemm_kernel_4x8_haswell.S

211ab03

wjc404 added 2 commits July 17, 2019 23:47

Update dgemm_kernel_4x8_haswell.S

8a074b3

Update dgemm_kernel_4x8_haswell.S

9b04bae

martin-frbg added this to the 0.3.7 milestone Jul 18, 2019

martin-frbg merged commit b0b7600 into OpenMathLib:develop Jul 18, 2019

martin-frbg mentioned this pull request Aug 7, 2019

Replace vpermpd with vpermilpd in the Haswell DTRMM kernel #2206

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update "dgemm_kernel_4x8_haswell.S" for improving performance on zen2 chips #2186

Update "dgemm_kernel_4x8_haswell.S" for improving performance on zen2 chips #2186

wjc404 commented Jul 16, 2019 •

edited

Loading

wjc404 commented Jul 17, 2019 •

edited

Loading

martin-frbg commented Jul 17, 2019

wjc404 commented Jul 17, 2019 •

edited

Loading

wjc404 commented Jul 17, 2019 •

edited

Loading

martin-frbg commented Jul 17, 2019

wjc404 commented Jul 17, 2019 •

edited

Loading

wjc404 commented Jul 17, 2019

TiborGY commented Jul 18, 2019

wjc404 commented Jul 18, 2019

TiborGY commented Jul 18, 2019

stevenwong commented Jul 23, 2019 •

edited

Loading

martin-frbg commented Jul 23, 2019

stevenwong commented Jul 23, 2019 •

edited

Loading

martin-frbg commented Jul 23, 2019

Update "dgemm_kernel_4x8_haswell.S" for improving performance on zen2 chips #2186

Update "dgemm_kernel_4x8_haswell.S" for improving performance on zen2 chips #2186

Conversation

wjc404 commented Jul 16, 2019 • edited Loading

wjc404 commented Jul 17, 2019 • edited Loading

martin-frbg commented Jul 17, 2019

wjc404 commented Jul 17, 2019 • edited Loading

wjc404 commented Jul 17, 2019 • edited Loading

martin-frbg commented Jul 17, 2019

wjc404 commented Jul 17, 2019 • edited Loading

wjc404 commented Jul 17, 2019

TiborGY commented Jul 18, 2019

wjc404 commented Jul 18, 2019

TiborGY commented Jul 18, 2019

stevenwong commented Jul 23, 2019 • edited Loading

martin-frbg commented Jul 23, 2019

stevenwong commented Jul 23, 2019 • edited Loading

martin-frbg commented Jul 23, 2019

wjc404 commented Jul 16, 2019 •

edited

Loading

wjc404 commented Jul 17, 2019 •

edited

Loading

wjc404 commented Jul 17, 2019 •

edited

Loading

wjc404 commented Jul 17, 2019 •

edited

Loading

wjc404 commented Jul 17, 2019 •

edited

Loading

stevenwong commented Jul 23, 2019 •

edited

Loading

stevenwong commented Jul 23, 2019 •

edited

Loading