-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Speed up sorting along a dimension of an array (fixes part of #9832) #12823
Conversation
pre_dims = dimsA[1:dim-1] | ||
post_dims = dimsA[dim+1:end] | ||
|
||
@inbounds for post_idx in CartesianRange(post_dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dimensionality of CartesianRange(post_dims)
cannot be inferred, since dim
is a value. It will be faster with a function barrier:
post_range = CartesianRange(post_dims)
pre_range = CartesianRange(pre_dims)
_sortdim!(A, pre_range, post_range; kws...)
and put @noinline
in front of _sortdim!
.
That said, it seems likely that most uses will be dominated by the actual sort!
operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense--thanks!
First of all, this is clearly a step forward---nice work! Definitely want this. But we can do even better. For great performance, CartesianIndex-splatting is not ideal: Base.start(index::CartesianIndex) = 1
Base.next(index::CartesianIndex, i) = (index[i], i+1)
Base.done(index::CartesianIndex, i) = i > length(index)
function iter1(A, pre, post)
s = 0.0
dim = length(pre)+1
for ipost in post
for i = 1:size(A, dim)
for ipre in pre
s += A[ipre, i, ipost]
end
end
end
s
end
function iter2(A, pre, post)
s = 0.0
dim = length(pre)+1
for ipost in post
for i = 1:size(A, dim)
for ipre in pre
s += A[ipre..., i, ipost...]
end
end
end
s
end
A = rand(10,10,10,10,10);
pre = CartesianRange(size(A)[1:2])
post = CartesianRange(size(A)[4:5])
# After JITing
julia> @time iter1(A, pre, post)
0.000123 seconds (5 allocations: 176 bytes)
5016.451040513574
julia> @time iter2(A, pre, post)
0.048879 seconds (200.00 k allocations: 8.545 MB, 22.05% gc time)
5016.451040513574 EDIT: since this is such a big performance hit, I'd rather not define the "splatting" versions (by making To do better, I see two possible paths:
R = CartesianRange(CartesianIndex((7,3,1,4)),CartesianIndex((7,3,500,4))) this is conceptually equivalent to passing |
Okay.
I agree this makes the most sense. |
This is great. I imagine it's still a good deal faster (or will be, once the loops are hidden behind a function barrier), but I'd be curious how it compares to simply using cartesian indexing in 👍 for adding cartesian indexing to sub and slice. They'll need it in any case for the big indexing-returns-views change. |
So, the real gain here is actually in the version specialized for Matrices. I wasn't timing the Profiling shows that about 2/3 the time is spent in So, I'll look at adding cartesian indexing to |
Yikes, |
You know, I think we're being silly here: sorting is very sensitive to cache misses. How about this implementation: function sort(A, dim; alg=?, order=?)
pdims = (dim, setdiff(1:ndims(A), dim)...) # put the selected dimension first
Ap = permutedims(A, pdims) # note Ap is an Array, no matter what A is
n = size(Ap, 1)
for s = 1:n:length(A)
sort!(Ap, s, s+n-1, alg, order)
end
ipermutedims(Ap, pdims)
end Just need to generalize the first input of all those I'll wager you that's many times faster than what we have now. Compared to sorting, two calls to |
Fleshed out version of @timholy's idea: function sort(A::AbstractArray, dim::Integer;
alg::Algorithm=DEFAULT_UNSTABLE,
lt=isless,
by=identity,
rev::Bool=false,
order::Ordering=Forward,
initialized::Bool=false)
pdims = (dim, setdiff(1:ndims(A), dim)...) # put the selected dimension first
Ap = permutedims(A, pdims) # note Ap is an Array, no matter what A is
n = size(Ap, 1)
order = ord(lt,by,rev,order)
for s = 1:n:length(Ap)
sort!(vec(Ap), s, s+n-1, alg, order)
end
ipermutedims(Ap, pdims)
end For the 30 million x 2 array (and disabling the 2D version from this PR), it's not bad: julia> @time sort(a,2);
9.896694 seconds (210.00 M allocations: 5.811 GB, 5.39% gc time) I don't have time to do any more exploration right now. |
Using a tweaked version (see below), I get this: julia> A = rand(10,10,10,10,10,10);
# After JITting
julia> @time sort(A, 4);
1.241920 seconds (4.83 M allocations: 206.155 MB, 4.47% gc time) Tweaked version: julia> @time sort(A, 4);
0.050624 seconds (141 allocations: 15.264 MB, 21.65% gc time) A 25x speedup is not bad... Here's my tweaked version (not quite 2x faster than yours): function sort(A::AbstractArray, dim::Integer;
alg::Algorithm=DEFAULT_UNSTABLE,
lt=isless,
by=identity,
rev::Bool=false,
order::Ordering=Forward,
initialized::Bool=false)
order = ord(lt,by,rev,order)
if dim != 1
pdims = (dim, setdiff(1:ndims(A), dim)...) # put the selected dimension first
Ap = permutedims(A, pdims) # note Ap is an Array, no matter what A is
n = size(Ap, 1)
Av = vec(Ap)
sort_chunks!(Av, n, alg, order)
ipermutedims(Ap, pdims)
else
Av = A[:]
sort_chunks!(Av, size(A,1), alg, order)
reshape(Av, size(A))
end
end
@noinline function sort_chunks!(Av, n, alg, order)
for s = 1:n:length(Av)
sort!(Av, s, s+n-1, alg, order)
end
Av
end |
Cool! |
@timholy, your version is clearly better than what I was proposing above (even the original version I specialized for At some point, I'd like to add an in-place version of
|
Thanks for the credit, but certainly no worries had it gone otherwise. And you deserve a lot of the credit, as you're the sort guru and know best how to integrate all this. As far as implementing The only reason I can think of to need a |
julia> a=rand(Int64,30000000,2);
julia> @time sort(a,2); # after warmup
0.941762 seconds (63 allocations: 915.530 MB, 6.85% gc time) Which is over 3.5 times as fast as my original version, and >45x the original base version. For the record, current master is at julia> @time sort(a,2);
34.939812 seconds (630.46 M allocations: 21.479 GB, 5.82% gc time) The change is probably because of the update to @StefanKarpinski, does a 35x improvement in performance for #9832 count as a bug fix, or should this wait until v0.4.1? |
RFC: Speed up sorting along a dimension of an array (fixes part of #9832)
CartesianRanges
CartesianIndex
Before:
After:
Notes:
The "before" time is already 10x faster than the time reported in sortslices(a; dims=1) is slow for numerical arrays #9832. It's not clear to me when this happened.This is because the array size is 10x smaller. Whoops. Numbers updated to match those in sortslices(a; dims=1) is slow for numerical arrays #9832.Because
slice
doesn't actually work directly withCartesianIndex
es (at least in the way that I needed), I also added methods for iterating over aCartesianIndex
, so that I could splat it inside of a call toslice
. But I'm unsure if this is desirable--perhaps it wasn't there to discourage users from splatting? (Cc: @mbauman @timholy)Most of the functionality is in
sortdim!(...)
, which is unexported.sort(a, dim; kws...)
callssort!(a, dim; kws...)
(new), which callssortdim!
.Defining a new function isn't strictly necessary, but "sorting" vs "sorting along a dimension" seem different enough to me that I'd like to suggest deprecating
sort(a, dim)
forsortdim(a, dim)
(in a future PR). Thoughts?