Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Speed up sorting along a dimension of an array (fixes part of #9832) #12823

Merged
merged 1 commit into from
Sep 7, 2015

Conversation

kmsquire
Copy link
Member

  • Sorts 1D array slices along the specified dimension
  • n-D case uses CartesianRanges
  • Also adds (and uses) iteration over components of a CartesianIndex

Before:

julia> a=rand(Int64,30000000,2);

julia> @time sort(a,2);
 45.318352 seconds (600.35 M allocations: 22.369 GB, 4.56% gc time)

After:

julia> a=rand(Int64,30000000,2);

julia> @time sort(a,2);
  3.586818 seconds (120.20 M allocations: 5.374 GB, 5.80% gc time)

Notes:

  1. The "before" time is already 10x faster than the time reported in sortslices(a; dims=1) is slow for numerical arrays #9832. It's not clear to me when this happened. This is because the array size is 10x smaller. Whoops. Numbers updated to match those in sortslices(a; dims=1) is slow for numerical arrays #9832.

  2. Because slice doesn't actually work directly with CartesianIndexes (at least in the way that I needed), I also added methods for iterating over a CartesianIndex, so that I could splat it inside of a call to slice. But I'm unsure if this is desirable--perhaps it wasn't there to discourage users from splatting? (Cc: @mbauman @timholy)

  3. Most of the functionality is in sortdim!(...), which is unexported. sort(a, dim; kws...) calls sort!(a, dim; kws...) (new), which calls sortdim!.

    Defining a new function isn't strictly necessary, but "sorting" vs "sorting along a dimension" seem different enough to me that I'd like to suggest deprecating sort(a, dim) for sortdim(a, dim) (in a future PR). Thoughts?

@kmsquire
Copy link
Member Author

In #9832, the array size was actually 10x larger. I've updated the timings inline to match #9832.

@kshyatt kshyatt added the performance Must go faster label Aug 27, 2015
pre_dims = dimsA[1:dim-1]
post_dims = dimsA[dim+1:end]

@inbounds for post_idx in CartesianRange(post_dims)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dimensionality of CartesianRange(post_dims) cannot be inferred, since dim is a value. It will be faster with a function barrier:

post_range = CartesianRange(post_dims)
pre_range = CartesianRange(pre_dims)
_sortdim!(A, pre_range, post_range; kws...)

and put @noinline in front of _sortdim!.

That said, it seems likely that most uses will be dominated by the actual sort! operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense--thanks!

@timholy
Copy link
Member

timholy commented Aug 27, 2015

First of all, this is clearly a step forward---nice work! Definitely want this.

But we can do even better. For great performance, CartesianIndex-splatting is not ideal:

Base.start(index::CartesianIndex) = 1
Base.next(index::CartesianIndex, i) = (index[i], i+1)
Base.done(index::CartesianIndex, i) = i > length(index)

function iter1(A, pre, post)
    s = 0.0
    dim = length(pre)+1
    for ipost in post
        for i = 1:size(A, dim)
            for ipre in pre
                s += A[ipre, i, ipost]
            end
        end
    end
    s
end

function iter2(A, pre, post)
    s = 0.0
    dim = length(pre)+1
    for ipost in post
        for i = 1:size(A, dim)
            for ipre in pre
                s += A[ipre..., i, ipost...]
            end
        end
    end
    s
end

A = rand(10,10,10,10,10);
pre = CartesianRange(size(A)[1:2])
post = CartesianRange(size(A)[4:5])

# After JITing
julia> @time iter1(A, pre, post)
  0.000123 seconds (5 allocations: 176 bytes)
5016.451040513574

julia> @time iter2(A, pre, post)
  0.048879 seconds (200.00 k allocations: 8.545 MB, 22.05% gc time)
5016.451040513574

EDIT: since this is such a big performance hit, I'd rather not define the "splatting" versions (by making CartesianIndex iterable) at all.

To do better, I see two possible paths:

  • allow sub and slice to accept CartesianIndex directly. This is surely the easiest route: you'd just need @generated functions similar to
    @generated function _getindex{T,N}(l::LinearIndexing, A::AbstractArray{T,N}, I::Union{Real,AbstractArray,Colon,CartesianIndex}...)
    :($(Expr(:meta, :inline)); getindex(A, $(cartindex_exprs(I, :I)...)))
    end
    @generated function _unsafe_getindex{T,N}(l::LinearIndexing, A::AbstractArray{T,N}, I::Union{Real,AbstractArray,Colon,CartesianIndex}...)
    :($(Expr(:meta, :inline)); unsafe_getindex(A, $(cartindex_exprs(I, :I)...)))
    end
    @generated function _setindex!{T,N}(l::LinearIndexing, A::AbstractArray{T,N}, v, I::Union{Real,AbstractArray,Colon,CartesianIndex}...)
    :($(Expr(:meta, :inline)); setindex!(A, v, $(cartindex_exprs(I, :I)...)))
    end
    @generated function _unsafe_setindex!{T,N}(l::LinearIndexing, A::AbstractArray{T,N}, v, I::Union{Real,AbstractArray,Colon,CartesianIndex}...)
    :($(Expr(:meta, :inline)); unsafe_setindex!(A, v, $(cartindex_exprs(I, :I)...)))
    end
  • In the longer term, it might make sense to rethink this whole sub/slice thing yet again: rather than focusing on slicing the container, perhaps we need to make algorithms "Cartesian slice"-aware. For example, if you passed a CartesianRange to sort! like this:
R = CartesianRange(CartesianIndex((7,3,1,4)),CartesianIndex((7,3,500,4)))

this is conceptually equivalent to passing slice(A, 7, 3, 1:500, 4). But that's for another day.

@kmsquire
Copy link
Member Author

EDIT: since this is such a big performance hit, I'd rather not define the "splatting" versions (by making CartesianIndex iterable) at all.

Okay.

allow sub and slice to accept CartesianIndex directly. This is surely the easiest route: you'd just need @generated functions similar to

@generated function _getindex{T,N}(l::LinearIndexing, A::AbstractArray{T,N}, I::Union{Real,AbstractArray,Colon,CartesianIndex}...)
:($(Expr(:meta, :inline)); getindex(A, $(cartindex_exprs(I, :I)...)))
end
@generated function _unsafe_getindex{T,N}(l::LinearIndexing, A::AbstractArray{T,N}, I::Union{Real,AbstractArray,Colon,CartesianIndex}...)
:($(Expr(:meta, :inline)); unsafe_getindex(A, $(cartindex_exprs(I, :I)...)))
end
@generated function _setindex!{T,N}(l::LinearIndexing, A::AbstractArray{T,N}, v, I::Union{Real,AbstractArray,Colon,CartesianIndex}...)
:($(Expr(:meta, :inline)); setindex!(A, v, $(cartindex_exprs(I, :I)...)))
end
@generated function _unsafe_setindex!{T,N}(l::LinearIndexing, A::AbstractArray{T,N}, v, I::Union{Real,AbstractArray,Colon,CartesianIndex}...)
:($(Expr(:meta, :inline)); unsafe_setindex!(A, v, $(cartindex_exprs(I, :I)...)))
end

I agree this makes the most sense. subarray.jl seems to contain a number of generated functions already, although none which work with CartesianIndex--hopefully the change is straightforward.

@mbauman
Copy link
Member

mbauman commented Aug 27, 2015

This is great. I imagine it's still a good deal faster (or will be, once the loops are hidden behind a function barrier), but I'd be curious how it compares to simply using cartesian indexing in mapslices: #12716 (edit: I suppose that could benefit from a function barrier, too).

👍 for adding cartesian indexing to sub and slice. They'll need it in any case for the big indexing-returns-views change.

@kmsquire
Copy link
Member Author

So, the real gain here is actually in the version specialized for Matrices. I wasn't timing the CartesianIndex version before--turns out it is actually a bit slower than (the current) mapslices, even with the function barrier (71 seconds when sorting the rows of the 30 million x 2 element, vs 45 seconds for the current system). So the CartesianRanges version is definitely not (yet) a win.

Profiling shows that about 2/3 the time is spent in to_indexes and append_any, which seems to suggest (?) that indexing/splatting are the time culprits.

So, I'll look at adding cartesian indexing to slice (and sub), but I won't get to it until at least this evening, possibly until this weekend. (No pressure, but if someone else wants to work on that first, feel free.)

@mbauman
Copy link
Member

mbauman commented Aug 27, 2015

Yikes, to_indexes should always be inlined and never show up in a profiling backtrace. How many dimensions were you using in your test?

@timholy
Copy link
Member

timholy commented Aug 27, 2015

You know, I think we're being silly here: sorting is very sensitive to cache misses. How about this implementation:

function sort(A, dim; alg=?, order=?)
    pdims = (dim, setdiff(1:ndims(A), dim)...)  # put the selected dimension first
    Ap = permutedims(A, pdims)    # note Ap is an Array, no matter what A is
    n = size(Ap, 1)
    for s = 1:n:length(A)
        sort!(Ap, s, s+n-1, alg, order)
    end
    ipermutedims(Ap, pdims)
end

Just need to generalize the first input of all those sort! methods from AbstractVector to AbstractArray (or if you prefer, to Union(AbstractVector,Array)).

I'll wager you that's many times faster than what we have now. Compared to sorting, two calls to permutedims should be free. (EDIT: and they will make everything else much faster.)

@kmsquire kmsquire changed the title RFC: Speed up sorting along a dimension of an array (fixes part of #9832) WIP: Speed up sorting along a dimension of an array (fixes part of #9832) Aug 28, 2015
@kmsquire
Copy link
Member Author

Fleshed out version of @timholy's idea:

function sort(A::AbstractArray, dim::Integer;
              alg::Algorithm=DEFAULT_UNSTABLE,
              lt=isless,
              by=identity,
              rev::Bool=false,
              order::Ordering=Forward,
              initialized::Bool=false)
    pdims = (dim, setdiff(1:ndims(A), dim)...)  # put the selected dimension first
    Ap = permutedims(A, pdims)    # note Ap is an Array, no matter what A is
    n = size(Ap, 1)
    order = ord(lt,by,rev,order)
    for s = 1:n:length(Ap)
        sort!(vec(Ap), s, s+n-1, alg, order)
    end
    ipermutedims(Ap, pdims)
end

For the 30 million x 2 array (and disabling the 2D version from this PR), it's not bad:

julia> @time sort(a,2);
  9.896694 seconds (210.00 M allocations: 5.811 GB, 5.39% gc time)

I don't have time to do any more exploration right now.

@timholy
Copy link
Member

timholy commented Aug 28, 2015

Using a tweaked version (see below), I get this:
Master:

julia> A = rand(10,10,10,10,10,10);

# After JITting
julia> @time sort(A, 4);
  1.241920 seconds (4.83 M allocations: 206.155 MB, 4.47% gc time)

Tweaked version:

julia> @time sort(A, 4);
  0.050624 seconds (141 allocations: 15.264 MB, 21.65% gc time)

A 25x speedup is not bad...

Here's my tweaked version (not quite 2x faster than yours):

function sort(A::AbstractArray, dim::Integer;
              alg::Algorithm=DEFAULT_UNSTABLE,
              lt=isless,
              by=identity,
              rev::Bool=false,
              order::Ordering=Forward,
              initialized::Bool=false)
    order = ord(lt,by,rev,order)
    if dim != 1
        pdims = (dim, setdiff(1:ndims(A), dim)...)  # put the selected dimension first
        Ap = permutedims(A, pdims)    # note Ap is an Array, no matter what A is
        n = size(Ap, 1)
        Av = vec(Ap)
        sort_chunks!(Av, n, alg, order)
        ipermutedims(Ap, pdims)
    else
        Av = A[:]
        sort_chunks!(Av, size(A,1), alg, order)
        reshape(Av, size(A))
    end
end

@noinline function sort_chunks!(Av, n, alg, order)
     for s = 1:n:length(Av)
        sort!(Av, s, s+n-1, alg, order)
    end
    Av
end

@kmsquire
Copy link
Member Author

Cool!

@kmsquire
Copy link
Member Author

@timholy, your version is clearly better than what I was proposing above (even the original version I specialized for Matrix types), so I replaced my PR with yours.

At some point, I'd like to add an in-place version of sort! based on what you wrote, but my first attempt was really slow (I feel like I'm losing my Julia mojo), so it'll take some more looking into.

Not sure if you care, but I attempted to change the commit author to you (since it's almost entirely your code), and locally, that's what it shows, but I can't see evidence of that in the version I just pushed here... Had to finish rebasing. It now shows as your commit.

@timholy
Copy link
Member

timholy commented Aug 29, 2015

Thanks for the credit, but certainly no worries had it gone otherwise. And you deserve a lot of the credit, as you're the sort guru and know best how to integrate all this.

As far as implementing sort!, I wouldn't be at all surprised if it were faster in this case to base sort! on sort and just copy the result back on top of the original. When you permute the array, you access each element once; when you sort, you access each element many times. Consequently, it's better to invest in ordering the data for cache friendliness, even if it seems like you're taking an extra step.

The only reason I can think of to need a sort!(A, dim) would be memory pressure, but to me that seems like poor motivation (we're talking about a factor of 3 here...).

@kmsquire
Copy link
Member Author

julia> a=rand(Int64,30000000,2);

julia> @time sort(a,2); # after warmup
  0.941762 seconds (63 allocations: 915.530 MB, 6.85% gc time)

Which is over 3.5 times as fast as my original version, and >45x the original base version. For the record, current master is at

julia> @time sort(a,2);
 34.939812 seconds (630.46 M allocations: 21.479 GB, 5.82% gc time)

The change is probably because of the update to mapslices in #12716 (as pointed out by @mbauman), but obviously, that still doesn't hold snuff to Tim Holy's specialized version here.

@StefanKarpinski, does a 35x improvement in performance for #9832 count as a bug fix, or should this wait until v0.4.1?

@kmsquire kmsquire changed the title WIP: Speed up sorting along a dimension of an array (fixes part of #9832) RFC: Speed up sorting along a dimension of an array (fixes part of #9832) Aug 29, 2015
@kmsquire
Copy link
Member Author

One of the travis builds had an OOM problem, and a test error (but it seems that the two are related). Is anyone (@Keno?, ref: #11553) looking into this type of problem more, or should I just restart CI?

JeffBezanson added a commit that referenced this pull request Sep 7, 2015
RFC: Speed up sorting along a dimension of an array (fixes part of #9832)
@JeffBezanson JeffBezanson merged commit 47f7b74 into master Sep 7, 2015
@tkelman tkelman deleted the kms/sortdim branch September 8, 2015 12:15
@LilithHafner LilithHafner added the sorting Put things in order label Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster sorting Put things in order
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants