A problem about relation between "openblas_set_num_threads" and CPU waste and time cost? #1544

sunbinbin1991 · 2018-05-04T11:47:47Z

First, I will introduce ,my background. I have to accelerate a matrix calculation. My hardware is RK3288 which have four Quad-Core Cortex-A17 cpu. When I use it by default,
it is waste much more time to calculate, and waste much CPU,as below,

cpu cost:

however, when I set the openblas_num_threads,like this:
openblas_set_num_threads(1);
I found, it waste not much time and CPU waste is much lower,as below,

cpu cost:

clearly, waste time and cpu waste both reduced. So, what happen for this problem?

The text was updated successfully, but these errors were encountered:

martin-frbg · 2018-05-04T11:56:24Z

I suspect without setting the number of threads it will try to use all four cores, causing conflicts with the other processes on your system (leading to rescheduling and invalidating caches). How is the performance with 2 or 3 threads (leaving one core for everybody else) ? It could also be that one of the OpenBLAS functions you use is needlessly trying to parallelize its workload, so it might be useful to know which functions you are calling, and how big your matrix sizes are.

sunbinbin1991 · 2018-05-04T12:05:37Z

@martin-frbg Thanks for your quick reply. I print default openblas_num_thread is 4. I have make a try. When I use 2 or 3 threads, the performance it similar like thread 4. CPU and time waste is still much higher.Here is function I used:
const int inc_x = 1;
const int inc_y = 1;
const float alpha = 1.0f;
const float beta = 0.0f;
cblas_sgemv(CblasRowMajor, CblasNoTrans, features_in_lib, featurelen, alpha, (float *) features.data(), featurelen, feature_to_compare, 1, beta, scores, inc_y);
And my matrix is about is about : features * feature_to_compare here, 6000 * 128 (features),128 * 1(feature_to_compare )

martin-frbg · 2018-05-04T12:20:12Z

There are two old tickets on sgemv performance compared to other BLAS implementations that were never conclusively solved (mostly from lack of data it seems), #532 and #821. On the other hand, the
GEMM threading algorithm was changed since the last release (PRs #1316 and #1320), which version of OpenBLAS are you using - the 0.2.20 release or the current develop branch from git ?

sunbinbin1991 · 2018-05-04T12:29:38Z

I just pull a version of openblas about Dec,2017, not a release version

martin-frbg · 2018-05-04T16:22:32Z

Actually my recollection was wrong, the two PRs I mentioned did not affect GEMV at all, sorry.

brada4 · 2018-05-05T20:01:27Z

Could you try to record source of waste with 'perf record' and 'perf report' and paste top page from later?
Also note that some spectre patches rewrite user mode codes and rob half of speed, best to recompile with spectre-aware gcc to run on patched kernel, to have nop sled in places to be live-patched.

martin-frbg · 2018-05-20T13:38:25Z

Running sgemv.goto 5000 6000 1000 from the benchmarks directory on an Asus Tinkerboard, i get:

THREADS           1                            2                     3                     4
5000x5000 344.05 MFlops 0.145s    417.69 MFlops 0.119s  558.78 MFlops 0.089s   624.05 MFlops  0.08s
6000x6000 312.73 MFlops 0.230s    432.10 MFlops 0.167s  532.67 MFlops 0.135s   576.67 MFlops  0.125s
10k x 10k 197.60 MFlops 1.012s    321.13 MFlops 0.622s  431.11 MFlops 0.463s  514.34 MFlops 0.389s
15k x 15k 147.25 MFlops 3.055s    257.74 MFlops 1.745s  356.05 MFlops 1.263s  437.21 MFlops 1.03s

the perf report for this benchmark is dominated by the calls to random() used to set up the matrix, but at least the mutex lock/unlock calls seem to make an insignificant contribution compared to the gemv kernel operations. Due to the fixed 2GB memory of the tinkerboard, the upper limit for the matrix size is around 21000x21000. perf stat does not provide much additional insight, need to rerun this with a matrix of ones to eliminate the calls to random() that may be driving the number of context switches.
"cpu migrations" as another cost factor turned out to be not influenced by building without the default NO_AFFINITY=1, will need to try taskset for this.
Hardware issues may play a role as well, assuming that insufficient cooling may cause the cpus to throttle down. ~~At least I noticed that I could easily crash the tinkerboard (with just its included small heatsink) by trying to do the default parallel build...~~ (Edit: tinkerboard problem solved by better power supply)

sunbinbin1991 · 2018-05-30T14:15:58Z

Thanks for your kind reply @martin-frbg . Actually，time cost is real have much influence for me. I want to use much small cpu waste because my limit of my hardware, so Is there some parameters for me to reduce cpu waste? Need I have rebuild it?

martin-frbg · 2018-05-31T18:28:42Z

I did not find a problem with the test, perhaps there is something special about your program that makes it behave differently than the gemv.goto benchmark.
Do you see the same problem when you try sgemv.goto 21000 21000 1 like I did ? (You need to run make in the benchmark directory first to get it)

martin-frbg · 2018-06-10T10:19:56Z

I do not see anything suspicious with xianyi's dgemv test case from #532 (comment) either (running with either "6000 128" or "128 6000" for 1000 iterations each). Unlikely that using cblas_sgemv instead of sgemv adds that much overhead to cause your problem. Perhaps retry with current 0.3.0 release ?

xsacha · 2018-06-16T22:37:09Z

I think the difference is from when you cross compile. Did you cross compile for test on tinker board?

martin-frbg · 2018-06-17T08:43:43Z

Good point, I compile locally. (I do see an improvement now with oon3m0oo's recent commit that cleaned up the memory initialization on thread startup, so sunbinbin1991 may want to try the current develop tree now if he is still interested)

sunbinbin1991 · 2018-06-27T09:06:26Z

As for my problem, most of possible reason is limitation of my hardware. In my project not only openblas is running, also, some other application is still on, maybe this is reason when I use "multi-thread" could waste much more time. Actually, even “openblas_set_num_threads = 1“, my matrix calculate is waste less than 1ms, that's enough for me.
I rebuild a new one, for "sgemv", “openblas_set_num_threads = 4“ is better than “openblas_set_num_threads = 1",
here is my hardware infomation:

cpu： 88 Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz

here is my result:

10000x512x1 0.008620 s 1187.935059 MFLOPS thread = 1
10000x512x1 0.005167 s 1981.807617 MFLOPS thread = 2
10000x512x1 0.003739 s 2738.700195 MFLOPS thread = 4
10000x512x1 0.002862 s 3577.917480 MFLOPS thread = 8
10000x512x1 0.002209 s 4635.581543 MFLOPS thread = 16

brada4 mentioned this issue May 15, 2018

Above average kernel times causing slow performance #1560

Open

martin-frbg mentioned this issue Jun 11, 2018

performance in sched_yield (multisocket) #900

Closed

sunbinbin1991 closed this as completed Jun 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A problem about relation between "openblas_set_num_threads" and CPU waste and time cost? #1544

A problem about relation between "openblas_set_num_threads" and CPU waste and time cost? #1544

sunbinbin1991 commented May 4, 2018 •

edited

Loading

martin-frbg commented May 4, 2018

sunbinbin1991 commented May 4, 2018 •

edited

Loading

martin-frbg commented May 4, 2018 •

edited

Loading

sunbinbin1991 commented May 4, 2018

martin-frbg commented May 4, 2018

brada4 commented May 5, 2018

martin-frbg commented May 20, 2018 •

edited

Loading

sunbinbin1991 commented May 30, 2018

martin-frbg commented May 31, 2018

martin-frbg commented Jun 10, 2018

xsacha commented Jun 16, 2018

martin-frbg commented Jun 17, 2018

sunbinbin1991 commented Jun 27, 2018 •

edited

Loading

A problem about relation between "openblas_set_num_threads" and CPU waste and time cost? #1544

A problem about relation between "openblas_set_num_threads" and CPU waste and time cost? #1544

Comments

sunbinbin1991 commented May 4, 2018 • edited Loading

martin-frbg commented May 4, 2018

sunbinbin1991 commented May 4, 2018 • edited Loading

martin-frbg commented May 4, 2018 • edited Loading

sunbinbin1991 commented May 4, 2018

martin-frbg commented May 4, 2018

brada4 commented May 5, 2018

martin-frbg commented May 20, 2018 • edited Loading

sunbinbin1991 commented May 30, 2018

martin-frbg commented May 31, 2018

martin-frbg commented Jun 10, 2018

xsacha commented Jun 16, 2018

martin-frbg commented Jun 17, 2018

sunbinbin1991 commented Jun 27, 2018 • edited Loading

sunbinbin1991 commented May 4, 2018 •

edited

Loading

sunbinbin1991 commented May 4, 2018 •

edited

Loading

martin-frbg commented May 4, 2018 •

edited

Loading

martin-frbg commented May 20, 2018 •

edited

Loading

sunbinbin1991 commented Jun 27, 2018 •

edited

Loading