Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster, indices-aware circshift (and non-allocating circshift!) #17861

Merged
merged 1 commit into from
Aug 9, 2016

Conversation

timholy
Copy link
Member

@timholy timholy commented Aug 6, 2016

On the benchmark in #17581, this is about 6x faster. On large vectors, this is also about 25% faster than the truly inplace 1d implementation of circshift! in #16032.

Fixes #16032, fixes #17581.

With regards to backporting to 0.5, perhaps the only question is whether we should export circshift! (if we're serious about "feature freeze").

@tkelman
Copy link
Contributor

tkelman commented Aug 6, 2016

(if we're serious about "feature freeze").

No new exports on the release branch.

@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Aug 6, 2016
@timholy timholy force-pushed the teh/circshift branch 2 times, most recently from 808c2cb to 942dc59 Compare August 7, 2016 20:44
@timholy
Copy link
Member Author

timholy commented Aug 7, 2016

Updated to guard against aliasing.

While I state that this closes #16032, I should clarify that it's a bit different from what @musm was requesting. The reason I implemented this with with separate src and dest arrays is that, even if the 2k->2 fusion that @stevengj mentioned is doable (and I bet it is), using such an implementation as the foundation for the out-of-place algorithm would require 3 full passes through the array: one to make a copy and then 2 (at least) for the inplace algorithm. In contrast, this implementation accomplishes it with a single pass, and is (unsurprisingly) faster. For those who want a nonallocating algorithm, they can preallocate the output, and swap buffers if they are doing this repeatedly in an loop. (Not that anything prevents anyone from also implementing the inplace version if they want to.)

@musm
Copy link
Contributor

musm commented Aug 7, 2016

Thank for this. I'm trying to figure out the code. To clarify, this is using the "chase-the-cycles" approach? And your suggestion is that a truly in place version is possible following stevengj's suggestion of the " two reversal passes" algorithm, but your point here
" 3 full passes through the array: one to make a copy and then 2 (at least) for the inplace algorithm." means there is no benefit of the truly in place algorithm unless the arrays are so large that creating another copy should at all cost be avoided?

@timholy
Copy link
Member Author

timholy commented Aug 7, 2016

"chase the cycles" means something different: it's one implementation (a less efficient one) of a truly inplace version. https://en.wikipedia.org/wiki/In-place_matrix_transposition describes a chase-the-cycles algorithm for inplace transposition, and that should give you the basic idea. chase-the-cycles algorithms seem elegant at first blush, but they're both tricky and slow, because they interact very badly with the CPU's cache.

Your second point is exactly correct: in my opinion, it's much better to allocate the memory. For truly huge arrays you'd start mmapping the arrays to swap, and in that case this algorithm would be a lot better than the one based on reversals. So the only situation in which the reversal one would be better is when your array is between 0.501 and 0.99 of your available RAM. In that case, just go buy more RAM 😄

I'm trying to figure out the code.

The way it works is like this: if you're doing a circshift! on

A B
C D

to

D C
B A

it does it using a recursive copy! algorithm. Conceptually, it's like this (callcopy! is essentially _circshift!):

callcopy!(dest, src, ("first half of dim1",)) -->
    callcopy!(dest, src, ("first half of dim1", "first half of dim2"))
    callcopy!(dest, src, ("first half of dim1", "second half of dim2"))
callcopy!(dest, src, ("second half of dim1",))-->
    callcopy!(dest, src, ("second half of dim1", "first half of dim2"))
    callcopy!(dest, src, ("second half of dim1", "second half of dim2"))

the "tuples" here are index ranges. Once all the indices are specified for all the dimensions, callcopy! just calls copy!. So it copies the array in k^2 blocks (where k is the dimension), using a twofold bifurcation for each dimension of the array.

The old implementation of circshift wasn't type-stable, and that's part of why it was slow. But it also allocated one index vector per dimension, using mod to handle the wrap-around, but that turns out to be considerably slower (mod is slow, and the allocation is bad for performance). By splitting the full range into two ranges, we avoid both of these problems.

See also `circshift`.
"""
@noinline function circshift!{T,N}(dest::AbstractArray{T,N}, src, shiftamt::DimsInteger)
dest === src && throw(ArgumentError("dest and src must be separate arrays"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs should probably note that dest should not alias src.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@timholy
Copy link
Member Author

timholy commented Aug 8, 2016

I'll merge this by the end of the day; it will be useful for my upcoming FFT fixes (#17896).

@timholy timholy merged commit b378ece into master Aug 9, 2016
@timholy timholy deleted the teh/circshift branch August 9, 2016 09:46
@timholy
Copy link
Member Author

timholy commented Aug 9, 2016

@tkelman, do you want me to prepare a backport version that leaves out the new export, or will you just delete that line yourself?

@tkelman
Copy link
Contributor

tkelman commented Aug 9, 2016

I'll delete that line. Would tests or anything else need to change?

@timholy
Copy link
Member Author

timholy commented Aug 9, 2016

Thanks. No changes needed; I anticipated this and adding scoping to the ones that would otherwise have had to change.

end
a[(I::NTuple{N,Vector{Int}})...]
function circshift(a::AbstractArray, shiftamt)
circshift!(similar(a), a, map(Integer, (shiftamt...,)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last time I tried this (in 1d), it was faster to do the out-of-place circshift directly. The in-place variant requires an extra pass over the array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nevermind, I see that circshift! is not actually in-place, because you pass both the source and destination arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map(Int, (shiftamt...,))? There doesn't seem much point in using Integer here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra pass

2 extra passes, actually, since the truly inplace version needs two passes and then there's the copy for the original array.

There doesn't seem much point in using Integer here.

I was just keeping consistency with the previous version, but I'm happy to change it---perhaps as part of a final resolution to the inconsistencies noted in #17567?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function can still accept Integer, but there seems no point in passing Integer when casting to Int should be fine. So this seems independent of #17567

@tkelman
Copy link
Contributor

tkelman commented Aug 9, 2016

Ah, I'll also want to delete it from the rst manual.

edit: or add Base. maybe

tkelman pushed a commit that referenced this pull request Aug 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

circshift is not performant Implement inplace circshift
5 participants