scale! is overly slow for arrays of small or moderate size #6951

lindahua · 2014-05-24T02:06:22Z

Recently, I studied the performance of some basic functions. To my surprise, I found that scale! is two orders of magnitude slower than a for-loop for small arrays. It is still slower than the for-loop for arrays of 1000 elements.

The script for benchmarking is here: https://gist.github.com/lindahua/30fa06f62a6a13c89007

Below is the results I got on a macbook pro

length = 2^ 1: scal1 ->   519.46 MPS, scal2 ->     3.95 MPS, gain = 0.0076
length = 2^ 2: scal1 ->   689.08 MPS, scal2 ->     5.71 MPS, gain = 0.0083
length = 2^ 3: scal1 ->  1075.97 MPS, scal2 ->    11.40 MPS, gain = 0.0106
length = 2^ 4: scal1 ->  1320.94 MPS, scal2 ->    22.36 MPS, gain = 0.0169
length = 2^ 5: scal1 ->  1245.93 MPS, scal2 ->    44.69 MPS, gain = 0.0359
length = 2^ 6: scal1 ->  1601.80 MPS, scal2 ->    88.94 MPS, gain = 0.0555
length = 2^ 7: scal1 ->  1959.22 MPS, scal2 ->   162.11 MPS, gain = 0.0827
length = 2^ 8: scal1 ->  2193.51 MPS, scal2 ->   310.83 MPS, gain = 0.1417
length = 2^ 9: scal1 ->  2344.08 MPS, scal2 ->   512.61 MPS, gain = 0.2187
length = 2^10: scal1 ->  2377.80 MPS, scal2 ->   906.05 MPS, gain = 0.3810
length = 2^11: scal1 ->  2340.63 MPS, scal2 ->  1469.17 MPS, gain = 0.6277
length = 2^12: scal1 ->  1722.84 MPS, scal2 ->  4297.47 MPS, gain = 2.4944
length = 2^13: scal1 ->  2082.73 MPS, scal2 ->  4058.47 MPS, gain = 1.9486
length = 2^14: scal1 ->  1714.95 MPS, scal2 ->  7186.35 MPS, gain = 4.1904
length = 2^15: scal1 ->  1771.31 MPS, scal2 ->  8947.81 MPS, gain = 5.0515
length = 2^16: scal1 ->  1622.49 MPS, scal2 ->  8086.58 MPS, gain = 4.9840
length = 2^17: scal1 ->  1733.88 MPS, scal2 -> 12604.37 MPS, gain = 7.2695
length = 2^18: scal1 ->  1647.16 MPS, scal2 ->  5671.81 MPS, gain = 3.4434
length = 2^19: scal1 ->  1338.98 MPS, scal2 -> 10321.70 MPS, gain = 7.7086
length = 2^20: scal1 ->   945.82 MPS, scal2 ->  2080.75 MPS, gain = 2.1999
length = 2^21: scal1 ->  1181.88 MPS, scal2 ->  1314.00 MPS, gain = 1.1118
length = 2^22: scal1 ->  1144.39 MPS, scal2 ->  1185.08 MPS, gain = 1.0356
length = 2^23: scal1 ->  1102.79 MPS, scal2 ->  1151.58 MPS, gain = 1.0442
length = 2^24: scal1 ->  1131.30 MPS, scal2 ->  1159.73 MPS, gain = 1.0251

Here, length is the length of the array. Performance is measured in terms of MPS, that is, Million numbers Per Second.

When length(x) < 16, the for-loop is over 100x faster than scale!, which relies on BLAS.scal!.

Seriously, I think we should implement scale! as follows

function scale!{T<:BlasFloat}(x::Array{T}, c::Number)
    n = length(x)
    if n < 2048   # this is a threshold based on my empirical finding
        for i  = 1:n
            x[i] *= c
        end
    else
        BLAS.scal!(n, c, x, 1)
    end
    x
end

The text was updated successfully, but these errors were encountered:

stevengj · 2014-05-24T03:39:08Z

Any idea why the BLAS.scal! function has so much overhead? 2048 is a lot for just function-call overhead.

Is it possible that this is over-aggressive threading (similar to #6923), i.e. OpenBLAS is trying to use multiple threads for small sizes? Maybe try turning off multi-threading in OpenBLAS?

cc: @staticfloat, @xianyi

ViralBShah · 2014-05-24T04:26:58Z

Even openblas copy is slower than you expect for small inputs. We should try threading fixes and also other BLAS. For now certainly update the scale implementation.

stevengj · 2014-05-24T12:23:54Z

This is the sort of thing that I would prefer to fix upstream; we aren't the only ones that this would affect. An if statement threshold to switch to a simple loop would be a trivial patch to the OpenBLAS sources.

tknopp · 2014-05-24T14:17:00Z

If feasible this would of course be great to have fixed upstream. Still, it might be good to make it easier to switch between a pure Julia implementation and the Blas implementation.

lindahua · 2014-05-24T15:50:54Z

With OPENBLAS_NUM_THREADS set to 1, the situation is a bit better:

length = 2^ 1: scal1 ->   530.85 MPS, scal2 ->    33.61 MPS, gain = 0.0633
length = 2^ 2: scal1 ->   813.55 MPS, scal2 ->    66.43 MPS, gain = 0.0817
length = 2^ 3: scal1 ->  1133.24 MPS, scal2 ->   131.60 MPS, gain = 0.1161
length = 2^ 4: scal1 ->  1386.27 MPS, scal2 ->   267.45 MPS, gain = 0.1929
length = 2^ 5: scal1 ->  1280.36 MPS, scal2 ->   519.74 MPS, gain = 0.4059
length = 2^ 6: scal1 ->  1687.69 MPS, scal2 ->   942.07 MPS, gain = 0.5582
length = 2^ 7: scal1 ->  1957.29 MPS, scal2 ->  1655.19 MPS, gain = 0.8457
length = 2^ 8: scal1 ->  2280.11 MPS, scal2 ->  2573.58 MPS, gain = 1.1287
length = 2^ 9: scal1 ->  2288.25 MPS, scal2 ->  3563.78 MPS, gain = 1.5574
length = 2^10: scal1 ->  2381.39 MPS, scal2 ->  4308.79 MPS, gain = 1.8094
length = 2^11: scal1 ->  2551.94 MPS, scal2 ->  5139.26 MPS, gain = 2.0139
length = 2^12: scal1 ->  2517.86 MPS, scal2 ->  5354.23 MPS, gain = 2.1265
length = 2^13: scal1 ->  2102.25 MPS, scal2 ->  4205.37 MPS, gain = 2.0004
length = 2^14: scal1 ->  2121.25 MPS, scal2 ->  4103.78 MPS, gain = 1.9346
length = 2^15: scal1 ->  2044.96 MPS, scal2 ->  3859.78 MPS, gain = 1.8875
length = 2^16: scal1 ->  1807.59 MPS, scal2 ->  3181.80 MPS, gain = 1.7602
length = 2^17: scal1 ->  1825.50 MPS, scal2 ->  3046.71 MPS, gain = 1.6690
length = 2^18: scal1 ->  1887.04 MPS, scal2 ->  3060.18 MPS, gain = 1.6217
length = 2^19: scal1 ->  1574.67 MPS, scal2 ->  2899.85 MPS, gain = 1.8416
length = 2^20: scal1 ->  1266.03 MPS, scal2 ->  1917.38 MPS, gain = 1.5145
length = 2^21: scal1 ->  1207.17 MPS, scal2 ->  1277.67 MPS, gain = 1.0584
length = 2^22: scal1 ->  1137.79 MPS, scal2 ->  1184.19 MPS, gain = 1.0408
length = 2^23: scal1 ->  1220.54 MPS, scal2 ->  1257.46 MPS, gain = 1.0302
length = 2^24: scal1 ->  1199.89 MPS, scal2 ->  1225.58 MPS, gain = 1.0214

Still the BLAS routine is still way too slow for arrays of smaller size. The cross point now is about 256.

Even the threading problem is fixed upstream, we still need a cutoff so that it switches to the simple for-loop implementation when the size of the array is below some threshold.

lindahua · 2014-05-24T16:55:44Z

I did more tests with BLAS level 1 routines, with and without the setting of OPENBLAS_NUM_THREADS = 1. Results are summarized below:

function	threading off	threading on
asum	16	16
nrm2	8	8
dot	128	128
scal	16	4096
axpy	128	256

The numbers in the table is the thresholding length: when the array lengths is equal to or above this threshold, OpenBLAS performs better than native loop.

The testing script and detailed results are on the gist: https://gist.github.com/lindahua/7bcb03acca22da4e87b9

Looks like that scal is the one that is particularly problematic when threading is turned on. For others, the numbers are more reasonable.

IainNZ · 2014-05-25T05:15:24Z

Why does the performance gap shrink again at large sizes?

lindahua · 2014-05-25T12:08:15Z

@IainNZ When the array is large enough, the function is memory bounded (i.e. dominated by the cost of memory access, such as page faults, etc). In such cases, using SIMD or parallelism does not lead to very significant impact.

andreasnoack · 2014-05-25T12:35:14Z

I have filed an issue at OpenMathLib/OpenBLAS#375

lindahua added the performance label May 24, 2014

lindahua mentioned this issue May 25, 2014

Set cutoff threshold for functions that rely on BLAS Level 1 routines (Close #6951) #6959

Merged

lindahua closed this as completed in 8eb9780 May 25, 2014

tknopp mentioned this issue May 27, 2014

generic linear-algebra functions for Number types (e.g. BigFloat) JuliaLang/LinearAlgebra.jl#11

Closed

simonster referenced this issue May 29, 2014

Pairwise BLAS-based sumabs and sumabs2

d33608c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scale! is overly slow for arrays of small or moderate size #6951

scale! is overly slow for arrays of small or moderate size #6951

lindahua commented May 24, 2014

stevengj commented May 24, 2014

ViralBShah commented May 24, 2014

stevengj commented May 24, 2014

tknopp commented May 24, 2014

lindahua commented May 24, 2014

lindahua commented May 24, 2014

IainNZ commented May 25, 2014

lindahua commented May 25, 2014

andreasnoack commented May 25, 2014

scale! is overly slow for arrays of small or moderate size #6951

scale! is overly slow for arrays of small or moderate size #6951

Comments

lindahua commented May 24, 2014

stevengj commented May 24, 2014

ViralBShah commented May 24, 2014

stevengj commented May 24, 2014

tknopp commented May 24, 2014

lindahua commented May 24, 2014

lindahua commented May 24, 2014

IainNZ commented May 25, 2014

lindahua commented May 25, 2014

andreasnoack commented May 25, 2014