random sampling from an (abstract)array is slow #20582

CarloLucibello · 2017-02-12T10:10:34Z

It is three times slower than this alternative implementation

julia> myrand(v) = (i = ceil(Int,rand()*length(v));  v[i])

julia> @benchmark rand(1:100)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     36.048 ns (0.00% GC)
  median time:      36.996 ns (0.00% GC)
  mean time:        37.104 ns (0.00% GC)
  maximum time:     66.210 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark myrand(1:100)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.824 ns (0.00% GC)
  median time:      12.332 ns (0.00% GC)
  mean time:        12.286 ns (0.00% GC)
  maximum time:     35.627 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999
  time tolerance:   5.00%
  memory tolerance: 1.00%

StefanKarpinski · 2017-02-13T20:00:38Z

cc @rfourquet, who made sure this stuff was all very fast at one point. We should add BaseBenchmarks for this so that it can't regress again. Thanks for reporting!

martinholters · 2017-02-14T12:13:05Z

Probably ceil(Int,rand()*length(v)) is a bad way to get a uniform distribution between 1 and length(v)? It might be faster than what we have, but might be biased?

CarloLucibello · 2017-02-14T14:19:06Z

Probably ceil(Int,rand()*length(v)) is a bad way to get a uniform distribution between 1 and length(v)? It might be faster than what we have, but might be biased?

Yes, I have to delve deeper into this, but it could be slightly unbiased, with a negligible bias for small length(v) wich gets higher for bigger vectors

StefanKarpinski · 2017-02-15T17:23:25Z

That version is also slightly buggy since there's a small chance of getting a zero index and then getting a bounds error upon indexing.

bkamins · 2018-08-31T09:26:03Z

This issue is recurring many times on Discourse (https://discourse.julialang.org/t/rand-1-10-vs-int-round-10-rand/14339/9), so I put some benchmarks.

Here is the code that shows where we have a problem:

# tweak function to make rand(range) return number of loop iterations
function Random.rand(rng::AbstractRNG, sp::Random.LessThan)
    i = 0
    while true
        i += 1
        x = rand(rng, sp.s)
        x <= sp.sup && return i-1
    end
end

And now running it for small ranges gives:

julia> [mean(rand(1:j) for i in 1:10^6) for j in 1:16]
16-element Array{Float64,1}:
 1.0
 1.0
 1.333166
 1.0
 1.600809
 1.333687
 1.142155
 1.0
 1.777679
 1.599978
 1.455112
 1.333272
 1.231023
 1.143564
 1.067084
 1.0

I guess it cannot be helped without using div/mod which is more expensive than generating an additional pseudorandom number (@rfourquet did I get your idea in this design right?).

And now the issue is that for a smal n the formula 1+floor(Int,n*rand()) has a small bias. For example for n=10^6 it is around 10^-10. Now, in order to detect this difference statistically the number of required samples is astronomical, so:

Maybe we want to allow for some small bias in trade off for speed for small n? If not in rand then maybe in something like fastrand? OTOH maybe it is better to do such things in packages. Not sure.

rfourquet · 2018-08-31T09:48:01Z

@rfourquet did I get your idea in this design right?

Indeed, that was the idea of the change in #27560, as generation with MersenneTwister is fast enough that this approach is usually more performant than using div/mod. That said, the default Sampler for other RNGs is still using div/mod (SamplerRangeInt).

rfourquet · 2018-08-31T09:50:44Z

I think that rather than fastrand, we could create an object Biased (or Fast), so that a call would look like rand(Biased(1:10)). Like that the rand infrastructure is still available (e.g. we don't have to reimplement array generation for fastrand).

bkamins · 2018-08-31T10:04:55Z

Nice idea. I think that Fast would be nice, and we could avoid exporting it so user would call it using Random.Fast (like with seed!) as I guess this operation will not be so common.

This uses a faster method than in rand(a:b), which can be biased, depending on the length of a:b.

Cf. https://arxiv.org/abs/1805.10941. Closes #20582, #29004.

* implement "nearly division less" algorithm for rand(a:b) Cf. https://arxiv.org/abs/1805.10941. Closes #20582, #29004. * fix overflow error in tests * make NDL the default algo * update NEWS.md * try make tests pass on 32-bits machines * add a comment for mod(-s, s) * remove vestigial transient `fast` function, and update comments

ararslan added collections Data structures holding multiple items, e.g. sets performance Must go faster labels Feb 13, 2017

tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Feb 13, 2017

rfourquet added a commit that referenced this issue Aug 31, 2018

implement rand(fast(a:b)) (fix #20582)

f102213

This uses a faster method than in rand(a:b), which can be biased, depending on the length of a:b.

rfourquet mentioned this issue Aug 31, 2018

RFC: implement a biased rand(a:b) (fix #20582) #28987

Closed

affans mentioned this issue Sep 2, 2018

improve performance of rand(n:m) #29004

Closed

rfourquet added a commit that referenced this issue Sep 18, 2018

implement "nearly division less" algorithm for rand(a:b)

ca8a771

Cf. https://arxiv.org/abs/1805.10941. Closes #20582, #29004.

rfourquet added a commit that referenced this issue Sep 18, 2018

implement "nearly division less" algorithm for rand(a:b)

d778dde

Cf. https://arxiv.org/abs/1805.10941. Closes #20582, #29004.

rfourquet added a commit that referenced this issue Sep 18, 2018

implement "nearly division less" algorithm for rand(a:b)

40ba678

Cf. https://arxiv.org/abs/1805.10941. Closes #20582, #29004.

rfourquet mentioned this issue Sep 18, 2018

implement "nearly division less" algorithm for rand(a:b) #29240

Merged

rfourquet added a commit that referenced this issue Apr 11, 2020

implement "nearly division less" algorithm for rand(a:b)

42fb808

Cf. https://arxiv.org/abs/1805.10941. Closes #20582, #29004.

rfourquet added a commit that referenced this issue Apr 21, 2020

implement "nearly division less" algorithm for rand(a:b)

f17e77d

Cf. https://arxiv.org/abs/1805.10941. Closes #20582, #29004.

rfourquet added a commit that referenced this issue Apr 27, 2020

implement "nearly division less" algorithm for rand(a:b)

68ac8e5

Cf. https://arxiv.org/abs/1805.10941. Closes #20582, #29004.

rfourquet added a commit that referenced this issue May 1, 2020

implement "nearly division less" algorithm for rand(a:b)

5facdd1

Cf. https://arxiv.org/abs/1805.10941. Closes #20582, #29004.

StefanKarpinski closed this as completed in #29240 May 1, 2020

rfourquet added the randomness Random number generation and the Random stdlib label May 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

random sampling from an (abstract)array is slow #20582

random sampling from an (abstract)array is slow #20582

CarloLucibello commented Feb 12, 2017 •

edited

Loading

StefanKarpinski commented Feb 13, 2017

martinholters commented Feb 14, 2017

CarloLucibello commented Feb 14, 2017

StefanKarpinski commented Feb 15, 2017

bkamins commented Aug 31, 2018

rfourquet commented Aug 31, 2018

rfourquet commented Aug 31, 2018

bkamins commented Aug 31, 2018

random sampling from an (abstract)array is slow #20582

random sampling from an (abstract)array is slow #20582

Comments

CarloLucibello commented Feb 12, 2017 • edited Loading

StefanKarpinski commented Feb 13, 2017

martinholters commented Feb 14, 2017

CarloLucibello commented Feb 14, 2017

StefanKarpinski commented Feb 15, 2017

bkamins commented Aug 31, 2018

rfourquet commented Aug 31, 2018

rfourquet commented Aug 31, 2018

bkamins commented Aug 31, 2018

CarloLucibello commented Feb 12, 2017 •

edited

Loading