Faster, indices-aware circshift (and non-allocating circshift!) #17861

timholy · 2016-08-06T03:23:01Z

On the benchmark in #17581, this is about 6x faster. On large vectors, this is also about 25% faster than the truly inplace 1d implementation of circshift! in #16032.

Fixes #16032, fixes #17581.

With regards to backporting to 0.5, perhaps the only question is whether we should export circshift! (if we're serious about "feature freeze").

tkelman · 2016-08-06T07:02:35Z

(if we're serious about "feature freeze").

No new exports on the release branch.

timholy · 2016-08-07T20:48:45Z

Updated to guard against aliasing.

While I state that this closes #16032, I should clarify that it's a bit different from what @musm was requesting. The reason I implemented this with with separate src and dest arrays is that, even if the 2k->2 fusion that @stevengj mentioned is doable (and I bet it is), using such an implementation as the foundation for the out-of-place algorithm would require 3 full passes through the array: one to make a copy and then 2 (at least) for the inplace algorithm. In contrast, this implementation accomplishes it with a single pass, and is (unsurprisingly) faster. For those who want a nonallocating algorithm, they can preallocate the output, and swap buffers if they are doing this repeatedly in an loop. (Not that anything prevents anyone from also implementing the inplace version if they want to.)

musm · 2016-08-07T21:26:42Z

Thank for this. I'm trying to figure out the code. To clarify, this is using the "chase-the-cycles" approach? And your suggestion is that a truly in place version is possible following stevengj's suggestion of the " two reversal passes" algorithm, but your point here
" 3 full passes through the array: one to make a copy and then 2 (at least) for the inplace algorithm." means there is no benefit of the truly in place algorithm unless the arrays are so large that creating another copy should at all cost be avoided?

timholy · 2016-08-07T22:11:45Z

"chase the cycles" means something different: it's one implementation (a less efficient one) of a truly inplace version. https://en.wikipedia.org/wiki/In-place_matrix_transposition describes a chase-the-cycles algorithm for inplace transposition, and that should give you the basic idea. chase-the-cycles algorithms seem elegant at first blush, but they're both tricky and slow, because they interact very badly with the CPU's cache.

Your second point is exactly correct: in my opinion, it's much better to allocate the memory. For truly huge arrays you'd start mmapping the arrays to swap, and in that case this algorithm would be a lot better than the one based on reversals. So the only situation in which the reversal one would be better is when your array is between 0.501 and 0.99 of your available RAM. In that case, just go buy more RAM 😄

I'm trying to figure out the code.

The way it works is like this: if you're doing a circshift! on

A B
C D

to

D C
B A

it does it using a recursive copy! algorithm. Conceptually, it's like this (callcopy! is essentially _circshift!):

callcopy!(dest, src, ("first half of dim1",)) -->
    callcopy!(dest, src, ("first half of dim1", "first half of dim2"))
    callcopy!(dest, src, ("first half of dim1", "second half of dim2"))
callcopy!(dest, src, ("second half of dim1",))-->
    callcopy!(dest, src, ("second half of dim1", "first half of dim2"))
    callcopy!(dest, src, ("second half of dim1", "second half of dim2"))

the "tuples" here are index ranges. Once all the indices are specified for all the dimensions, callcopy! just calls copy!. So it copies the array in k^2 blocks (where k is the dimension), using a twofold bifurcation for each dimension of the array.

The old implementation of circshift wasn't type-stable, and that's part of why it was slow. But it also allocated one index vector per dimension, using mod to handle the wrap-around, but that turns out to be considerably slower (mod is slow, and the allocation is bad for performance). By splitting the full range into two ranges, we avoid both of these problems.

simonster · 2016-08-08T02:05:57Z

base/multidimensional.jl

+See also `circshift`.
+"""
+@noinline function circshift!{T,N}(dest::AbstractArray{T,N}, src, shiftamt::DimsInteger)
+    dest === src && throw(ArgumentError("dest and src must be separate arrays"))


The docs should probably note that dest should not alias src.

Fixes #16032, fixes #17581

timholy · 2016-08-08T17:34:36Z

I'll merge this by the end of the day; it will be useful for my upcoming FFT fixes (#17896).

timholy · 2016-08-09T09:47:48Z

@tkelman, do you want me to prepare a backport version that leaves out the new export, or will you just delete that line yourself?

tkelman · 2016-08-09T09:55:01Z

I'll delete that line. Would tests or anything else need to change?

timholy · 2016-08-09T14:56:57Z

Thanks. No changes needed; I anticipated this and adding scoping to the ones that would otherwise have had to change.

stevengj · 2016-08-09T15:23:47Z

base/abstractarraymath.jl

-    end
-    a[(I::NTuple{N,Vector{Int}})...]
+function circshift(a::AbstractArray, shiftamt)
+    circshift!(similar(a), a, map(Integer, (shiftamt...,)))


The last time I tried this (in 1d), it was faster to do the out-of-place circshift directly. The in-place variant requires an extra pass over the array.

Oh, nevermind, I see that circshift! is not actually in-place, because you pass both the source and destination arrays.

map(Int, (shiftamt...,))? There doesn't seem much point in using Integer here.

extra pass

2 extra passes, actually, since the truly inplace version needs two passes and then there's the copy for the original array.

There doesn't seem much point in using Integer here.

I was just keeping consistency with the previous version, but I'm happy to change it---perhaps as part of a final resolution to the inconsistencies noted in #17567?

The function can still accept Integer, but there seems no point in passing Integer when casting to Int should be fine. So this seems independent of #17567

tkelman · 2016-08-09T21:47:00Z

Ah, I'll also want to delete it from the rst manual.

edit: or add Base. maybe

Fixes #16032, fixes #17581 (cherry picked from commit 60660b5) ref #17861

ref #17861 (comment)

timholy added the backport pending 0.5 label Aug 6, 2016

tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Aug 6, 2016

timholy force-pushed the teh/circshift branch 2 times, most recently from 808c2cb to 942dc59 Compare August 7, 2016 20:44

timholy force-pushed the teh/circshift branch from 942dc59 to 6125381 Compare August 7, 2016 22:07

simonster reviewed Aug 8, 2016
View reviewed changes

Faster, indices-aware circshift (and non-allocating circshift!)

60660b5

Fixes #16032, fixes #17581

timholy force-pushed the teh/circshift branch from 6125381 to 60660b5 Compare August 8, 2016 10:10

timholy merged commit b378ece into master Aug 9, 2016

timholy deleted the teh/circshift branch August 9, 2016 09:46

stevengj reviewed Aug 9, 2016
View reviewed changes

tkelman pushed a commit that referenced this pull request Aug 11, 2016

Faster, indices-aware circshift (and non-allocating circshift!)

8a1cf9b

Fixes #16032, fixes #17581 (cherry picked from commit 60660b5) ref #17861

tkelman added a commit that referenced this pull request Aug 11, 2016

Remove export and rst docs for circshift! on release-0.5

d02d911

ref #17861 (comment)

tkelman removed the backport pending 0.5 label Aug 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster, indices-aware circshift (and non-allocating circshift!) #17861

Faster, indices-aware circshift (and non-allocating circshift!) #17861

timholy commented Aug 6, 2016

tkelman commented Aug 6, 2016

timholy commented Aug 7, 2016

musm commented Aug 7, 2016

timholy commented Aug 7, 2016 •

edited

Loading

simonster Aug 8, 2016

timholy Aug 8, 2016

timholy commented Aug 8, 2016

timholy commented Aug 9, 2016

tkelman commented Aug 9, 2016

timholy commented Aug 9, 2016

stevengj Aug 9, 2016

stevengj Aug 9, 2016

stevengj Aug 9, 2016

timholy Aug 9, 2016

stevengj Aug 9, 2016

tkelman commented Aug 9, 2016 •

edited

Loading

Faster, indices-aware circshift (and non-allocating circshift!) #17861

Faster, indices-aware circshift (and non-allocating circshift!) #17861

Conversation

timholy commented Aug 6, 2016

tkelman commented Aug 6, 2016

timholy commented Aug 7, 2016

musm commented Aug 7, 2016

timholy commented Aug 7, 2016 • edited Loading

simonster Aug 8, 2016

Choose a reason for hiding this comment

timholy Aug 8, 2016

Choose a reason for hiding this comment

timholy commented Aug 8, 2016

timholy commented Aug 9, 2016

tkelman commented Aug 9, 2016

timholy commented Aug 9, 2016

stevengj Aug 9, 2016

Choose a reason for hiding this comment

stevengj Aug 9, 2016

Choose a reason for hiding this comment

stevengj Aug 9, 2016

Choose a reason for hiding this comment

timholy Aug 9, 2016

Choose a reason for hiding this comment

stevengj Aug 9, 2016

Choose a reason for hiding this comment

tkelman commented Aug 9, 2016 • edited Loading

timholy commented Aug 7, 2016 •

edited

Loading

tkelman commented Aug 9, 2016 •

edited

Loading