Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale! is overly slow for arrays of small or moderate size #6951

Closed
lindahua opened this issue May 24, 2014 · 9 comments
Closed

scale! is overly slow for arrays of small or moderate size #6951

lindahua opened this issue May 24, 2014 · 9 comments
Labels
performance Must go faster

Comments

@lindahua
Copy link
Contributor

Recently, I studied the performance of some basic functions. To my surprise, I found that scale! is two orders of magnitude slower than a for-loop for small arrays. It is still slower than the for-loop for arrays of 1000 elements.

The script for benchmarking is here: https://gist.github.com/lindahua/30fa06f62a6a13c89007

Below is the results I got on a macbook pro

length = 2^ 1: scal1 ->   519.46 MPS, scal2 ->     3.95 MPS, gain = 0.0076
length = 2^ 2: scal1 ->   689.08 MPS, scal2 ->     5.71 MPS, gain = 0.0083
length = 2^ 3: scal1 ->  1075.97 MPS, scal2 ->    11.40 MPS, gain = 0.0106
length = 2^ 4: scal1 ->  1320.94 MPS, scal2 ->    22.36 MPS, gain = 0.0169
length = 2^ 5: scal1 ->  1245.93 MPS, scal2 ->    44.69 MPS, gain = 0.0359
length = 2^ 6: scal1 ->  1601.80 MPS, scal2 ->    88.94 MPS, gain = 0.0555
length = 2^ 7: scal1 ->  1959.22 MPS, scal2 ->   162.11 MPS, gain = 0.0827
length = 2^ 8: scal1 ->  2193.51 MPS, scal2 ->   310.83 MPS, gain = 0.1417
length = 2^ 9: scal1 ->  2344.08 MPS, scal2 ->   512.61 MPS, gain = 0.2187
length = 2^10: scal1 ->  2377.80 MPS, scal2 ->   906.05 MPS, gain = 0.3810
length = 2^11: scal1 ->  2340.63 MPS, scal2 ->  1469.17 MPS, gain = 0.6277
length = 2^12: scal1 ->  1722.84 MPS, scal2 ->  4297.47 MPS, gain = 2.4944
length = 2^13: scal1 ->  2082.73 MPS, scal2 ->  4058.47 MPS, gain = 1.9486
length = 2^14: scal1 ->  1714.95 MPS, scal2 ->  7186.35 MPS, gain = 4.1904
length = 2^15: scal1 ->  1771.31 MPS, scal2 ->  8947.81 MPS, gain = 5.0515
length = 2^16: scal1 ->  1622.49 MPS, scal2 ->  8086.58 MPS, gain = 4.9840
length = 2^17: scal1 ->  1733.88 MPS, scal2 -> 12604.37 MPS, gain = 7.2695
length = 2^18: scal1 ->  1647.16 MPS, scal2 ->  5671.81 MPS, gain = 3.4434
length = 2^19: scal1 ->  1338.98 MPS, scal2 -> 10321.70 MPS, gain = 7.7086
length = 2^20: scal1 ->   945.82 MPS, scal2 ->  2080.75 MPS, gain = 2.1999
length = 2^21: scal1 ->  1181.88 MPS, scal2 ->  1314.00 MPS, gain = 1.1118
length = 2^22: scal1 ->  1144.39 MPS, scal2 ->  1185.08 MPS, gain = 1.0356
length = 2^23: scal1 ->  1102.79 MPS, scal2 ->  1151.58 MPS, gain = 1.0442
length = 2^24: scal1 ->  1131.30 MPS, scal2 ->  1159.73 MPS, gain = 1.0251

Here, length is the length of the array. Performance is measured in terms of MPS, that is, Million numbers Per Second.

When length(x) < 16, the for-loop is over 100x faster than scale!, which relies on BLAS.scal!.

Seriously, I think we should implement scale! as follows

function scale!{T<:BlasFloat}(x::Array{T}, c::Number)
    n = length(x)
    if n < 2048   # this is a threshold based on my empirical finding
        for i  = 1:n
            x[i] *= c
        end
    else
        BLAS.scal!(n, c, x, 1)
    end
    x
end
@stevengj
Copy link
Member

Any idea why the BLAS.scal! function has so much overhead? 2048 is a lot for just function-call overhead.

Is it possible that this is over-aggressive threading (similar to #6923), i.e. OpenBLAS is trying to use multiple threads for small sizes? Maybe try turning off multi-threading in OpenBLAS?

cc: @staticfloat, @xianyi

@ViralBShah
Copy link
Member

Even openblas copy is slower than you expect for small inputs. We should try threading fixes and also other BLAS. For now certainly update the scale implementation.

@stevengj
Copy link
Member

This is the sort of thing that I would prefer to fix upstream; we aren't the only ones that this would affect. An if statement threshold to switch to a simple loop would be a trivial patch to the OpenBLAS sources.

@tknopp
Copy link
Contributor

tknopp commented May 24, 2014

If feasible this would of course be great to have fixed upstream. Still, it might be good to make it easier to switch between a pure Julia implementation and the Blas implementation.

@lindahua
Copy link
Contributor Author

With OPENBLAS_NUM_THREADS set to 1, the situation is a bit better:

length = 2^ 1: scal1 ->   530.85 MPS, scal2 ->    33.61 MPS, gain = 0.0633
length = 2^ 2: scal1 ->   813.55 MPS, scal2 ->    66.43 MPS, gain = 0.0817
length = 2^ 3: scal1 ->  1133.24 MPS, scal2 ->   131.60 MPS, gain = 0.1161
length = 2^ 4: scal1 ->  1386.27 MPS, scal2 ->   267.45 MPS, gain = 0.1929
length = 2^ 5: scal1 ->  1280.36 MPS, scal2 ->   519.74 MPS, gain = 0.4059
length = 2^ 6: scal1 ->  1687.69 MPS, scal2 ->   942.07 MPS, gain = 0.5582
length = 2^ 7: scal1 ->  1957.29 MPS, scal2 ->  1655.19 MPS, gain = 0.8457
length = 2^ 8: scal1 ->  2280.11 MPS, scal2 ->  2573.58 MPS, gain = 1.1287
length = 2^ 9: scal1 ->  2288.25 MPS, scal2 ->  3563.78 MPS, gain = 1.5574
length = 2^10: scal1 ->  2381.39 MPS, scal2 ->  4308.79 MPS, gain = 1.8094
length = 2^11: scal1 ->  2551.94 MPS, scal2 ->  5139.26 MPS, gain = 2.0139
length = 2^12: scal1 ->  2517.86 MPS, scal2 ->  5354.23 MPS, gain = 2.1265
length = 2^13: scal1 ->  2102.25 MPS, scal2 ->  4205.37 MPS, gain = 2.0004
length = 2^14: scal1 ->  2121.25 MPS, scal2 ->  4103.78 MPS, gain = 1.9346
length = 2^15: scal1 ->  2044.96 MPS, scal2 ->  3859.78 MPS, gain = 1.8875
length = 2^16: scal1 ->  1807.59 MPS, scal2 ->  3181.80 MPS, gain = 1.7602
length = 2^17: scal1 ->  1825.50 MPS, scal2 ->  3046.71 MPS, gain = 1.6690
length = 2^18: scal1 ->  1887.04 MPS, scal2 ->  3060.18 MPS, gain = 1.6217
length = 2^19: scal1 ->  1574.67 MPS, scal2 ->  2899.85 MPS, gain = 1.8416
length = 2^20: scal1 ->  1266.03 MPS, scal2 ->  1917.38 MPS, gain = 1.5145
length = 2^21: scal1 ->  1207.17 MPS, scal2 ->  1277.67 MPS, gain = 1.0584
length = 2^22: scal1 ->  1137.79 MPS, scal2 ->  1184.19 MPS, gain = 1.0408
length = 2^23: scal1 ->  1220.54 MPS, scal2 ->  1257.46 MPS, gain = 1.0302
length = 2^24: scal1 ->  1199.89 MPS, scal2 ->  1225.58 MPS, gain = 1.0214

Still the BLAS routine is still way too slow for arrays of smaller size. The cross point now is about 256.

Even the threading problem is fixed upstream, we still need a cutoff so that it switches to the simple for-loop implementation when the size of the array is below some threshold.

@lindahua
Copy link
Contributor Author

I did more tests with BLAS level 1 routines, with and without the setting of OPENBLAS_NUM_THREADS = 1. Results are summarized below:

function threading off threading on
asum 16 16
nrm2 8 8
dot 128 128
scal 16 4096
axpy 128 256

The numbers in the table is the thresholding length: when the array lengths is equal to or above this threshold, OpenBLAS performs better than native loop.

The testing script and detailed results are on the gist: https://gist.github.com/lindahua/7bcb03acca22da4e87b9

Looks like that scal is the one that is particularly problematic when threading is turned on. For others, the numbers are more reasonable.

@IainNZ
Copy link
Member

IainNZ commented May 25, 2014

Why does the performance gap shrink again at large sizes?

@lindahua
Copy link
Contributor Author

@IainNZ When the array is large enough, the function is memory bounded (i.e. dominated by the cost of memory access, such as page faults, etc). In such cases, using SIMD or parallelism does not lead to very significant impact.

@andreasnoack
Copy link
Member

I have filed an issue at OpenMathLib/OpenBLAS#375

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants