Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding unique! functionality #20576

Closed
wants to merge 8 commits into from

Conversation

kvmanohar22
Copy link
Contributor

@kvmanohar22 kvmanohar22 commented Feb 11, 2017

In Ref #20549
Adding a naive implementation to unique! function, will be optimized

@ararslan ararslan added collections Data structures holding multiple items, e.g. sets needs tests Unit tests are required for this change labels Feb 11, 2017
base/set.jl Outdated
Removes recurrences of items and returns the modified collection
"""
function unique!(itr)
seen = Set{eltype(itr)}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use 4 spaces rather than tabs for indentation for consistency with the rest of the repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have noted that, will update it in the next commit

@kvmanohar22
Copy link
Contributor Author

Tweaked the method a bit, directly deleting the elements from itr (as in first implementation) was giving BoundsError. So, currently I'm storing all the duplicateIndices in a vector and finally deleting the duplicates from original itr
Changed the tabs to spaces.

base/set.jl Outdated
end
end
return itr
T = eltype(itr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use 4 spaces for indentation instead of 3 for consistency with the rest of the repo.

Copy link
Contributor Author

@kvmanohar22 kvmanohar22 Feb 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to make any changes to the method I have implemented ?
I will add test cases soon. Will update the indentation in the next commit

Copy link
Contributor

@Tetralux Tetralux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use some optimization.

julia> A = [rand(1:10) for _ in 1:100_000];

julia> unique!(Int[]); unique(Int[]); @time(1) # JIT warmup

julia> @time unique(A);
  0.002214 seconds (13 allocations: 1008 bytes)

julia> @time unique!(A); # from this PR
  2.650326 seconds (26 allocations: 2.001 MiB)

One reason for this is probably since deleteat! shifts the remaining elements by one to overwrite the element you wish to delete.
You don't want this shift to occur a lot, so it should be called as little as possible, and provide it a range to delete as the second parameter: deleteat!(itr, 1:5), for example.

@ararslan
Copy link
Member

ararslan commented Feb 12, 2017

Note that functions that modify their arguments only make sense for mutable objects. so a check on isimmutable may be in order. Or the signature could be restricted to Arrays, which are guaranteed to be mutable.

This should be somewhat more efficient. Note that the deletion happens all at once at the end using a single call to deleteat!.

function unique!(itr)
    seen = Set{eltype(itr)}()
    index_iterator = eachindex(itr)
    inds = Set{eltype(index_iterator)}()
    @inbounds for i in index_iterator
        x = itr[i]
        x in seen ? push!(inds, i) : push!(seen, x)
    end
    deleteat!(itr, inds)
    return itr
end

Edit: Updated after a discussion with @mbauman.

@fredrikekre
Copy link
Member

fredrikekre commented Feb 12, 2017

Unfortunately deleteat! does not accept a Set that gives an error:

julia> unique!(A)
ERROR: ArgumentError: indices must be unique and sorted
Stacktrace:
 [1] _deleteat!(::Array{Int64,1}, ::Set{Int64}) at ./array.jl:834
 [2] unique!(::Array{Int64,1}) at ./REPL[1]:9

@ararslan
Copy link
Member

Yeah, just realized that. That's why I originally had IntSet, but that won't handle OffsetArrays since the set of indices may not be strictly positive.

inds = Vector{eltype(index_iterator)}()

should do it, then.

@kvmanohar22
Copy link
Contributor Author

kvmanohar22 commented Feb 12, 2017

Yes, that works !
Here's the comparison

with the earlier implementation

julia> A = [rand(1:10) for _ in 1:100_000];
julia> @time unique!(A);
  0.037155 seconds (6.46 k allocations: 2.266 MB)

with the proposed implementation (deleting all the indices at once)

julia> A = [rand(1:10) for _ in 1:100_000];
julia> @time unique!(A);
  0.007427 seconds (26 allocations: 2.001 MB)

@Tetralux
Copy link
Contributor

Updated stats:

julia> A = [rand(1:10) for _ in 1:100_000];

julia> function unique!(itr)
           seen = Set{eltype(itr)}()
           index_iterator = eachindex(itr)
           inds = Vector{eltype(index_iterator)}()
           @inbounds for i in index_iterator
               x = itr[i]
               x in seen ? push!(inds, i) : push!(seen, x)
           end
           deleteat!(itr, inds)
           return itr
       end
unique! (generic function with 1 method)

julia> @time(1); unique!(Int[]); unique(Int[]); # JIT warmup
  0.000004 seconds (107 allocations: 7.266 KiB)

julia> @time unique(A); @time unique!(A);
  0.002006 seconds (13 allocations: 1008 bytes)
  0.003708 seconds (26 allocations: 2.001 MiB)

Allocations are a little high still. :-)

@fredrikekre
Copy link
Member

Allocations are a little high still. :-)

Isn't that inevitable? For unique! all the duplicate indices need to be stored, but for unique they can just be forgotten.

@kvmanohar22
Copy link
Contributor Author

kvmanohar22 commented Feb 12, 2017

How about we raise an exception when we encounter immutable objects ?
Something like this

...
 function unique!(itr)
    if isimmutable(itr)
        throw("Collection contains immutable objects")
    T = eltype(itr)
    seen = Set{T}()
    index_iterator = eachindex(itr)
    ...

@ararslan
Copy link
Member

I originally had that in my proposed implementation as well, but that doesn't quite do what we want.

julia> immutable A{T} <: AbstractVector{Int}
           data::T
       end
       Base.size(X::A) = size(X.data)
       Base.getindex(X::A, i::Int) = X.data[i]
       Base.setindex!(X::A, v, i::Int) = setindex!(X.data, v, i)

julia> x = A([1,2,3])
3-element A{Array{Int64,1}}:
 1
 2
 3

julia> isimmutable(x)
true

julia> x[2] = 0
0

julia> x
3-element A{Array{Int64,1}}:
 1
 0
 3

It's probably better to just let deleteat! fail on actually immutable inputs.

@StefanKarpinski
Copy link
Member

I would write this for arrays first, make it fast and then figure out how to generalize it by looking at the kinds of operations one needs. The array version should not need to allocate except for the seen set.

@StefanKarpinski
Copy link
Member

As extra credit, if the values are all in order or reverse order, no allocation is necessary at all. Since that's a fairly common case, it would be well worth having a fast path for it. It's a bit tricky to implement that switch from the fast path to the slow path though.

@JackDevine
Copy link
Contributor

Hi all,

I am a little late to the party I know, but I managed to implement the function without needing to allocate besides the seen set as described by @StefanKarpinski . Here is my code:

function uniquejd!(itr)  # I used my initials to distinguish the function.
    seen = Set{eltype(itr)}()
    count = 0  # Number of unique values seen so far.
    for x in itr
        @inbounds if (x, seen)
            count += 1            
            push!(seen, x)
            itr[count] = x
        end
    end
    resize!(itr, count)
end

function uniqueh225!(itr)
    seen = Set{eltype(itr)}()
    index_iterator = eachindex(itr)
    inds = Vector{eltype(index_iterator)}()
    @inbounds for i in index_iterator
       x = itr[i]
       x in seen ? push!(inds, i) : push!(seen, x)
       end
    deleteat!(itr, inds)
    return itr
end

If I then do the performance tests that @h-225 did, I get the following:

julia> srand(42); A = [rand(1:10) for _ in 1:100_000];
julia> uniquejd!(Int[])
julia> @time uniquejd!(A);
 0.001507 seconds (14 allocations: 1024 bytes)
julia> srand(42); A = [rand(1:10) for _ in 1:100_000];
julia> uniquejd!(Int[])
julia> @time uniqueh225!(A);
  0.002825 seconds (31 allocations: 2.002 MB)

So roughly a factor of two speed up and much less allocation.
I am using a 2013 MBP with Julia v0.5.0

@KristofferC
Copy link
Member

KristofferC commented Feb 13, 2017

Some more (maybe a bit more thorough) benchmarking:

using BenchmarkTools
# The copy is insignificant
fh22d(A) = (B = copy(A); uniqueh225!(B))
fjd(A) = (B = copy(A); uniquejd!(B))

julia> @btime fh22d($A)
  1.915 ms (24 allocations: 2.76 MiB)

julia> @btime fjd($A)
  1.276 ms (7 allocations: 781.81 KiB)

julia> @btime unique($A)
  1.128 ms (9 allocations: 848 bytes)

@Tetralux
Copy link
Contributor

@JackDevine
Funnily enough, I was at a loose end earlier and was wondering about this.
So I also had a go. 😄

function unique!(A::DenseArray)
    seen        = Set{eltype(A)}()
    orig_length = length(A)
    offset      = 0

    for i in eachindex(A)
        if i + offset > orig_length
            break
        end

        # Copy the next unique element into the current position, and
        # keep track of how many indices we have skipped.
        while A[i + offset] in seen
            offset += 1
            if i + offset > orig_length
                @goto break_outer_loop
            end
        end
        if offset > 0
            A[i] = A[i + offset]
        end

        push!(seen, A[i])
    end
    @label break_outer_loop
    return resize!(A, orig_length - offset)
end
julia> @time Int[]
  0.000003 seconds (5 allocations: 240 bytes)
0-element Array{Int64,1}

julia> unique!(Int[]);

julia> A = [rand(1:10) for _ in 1:100_000];

julia> @time unique!(A)
  0.002092 seconds (9 allocations: 656 bytes)

@pabloferz
Copy link
Contributor

pabloferz commented Feb 13, 2017

@JackDevine, that's a nice proposal. For completeness, an equally fast version that handles more general iterables would be

function unique!(itr)
    seen = Set{eltype(itr)}()
    idxs = eachindex(itr)
    m = n = start(idxs)
    count = 0
    @inbounds while !done(idxs, n)
        i, n = next(idxs, n)
        x = itr[i]
        if x  seen
            count += 1
            push!(seen, x)
            j, m = next(idxs, m)
            itr[j] = x
        end
    end
    resize!(itr, count)
end

base/set.jl Outdated
return itr
T = eltype(itr)
seen = Set{T}()
duplicateIndex = Vector{Int64}()
Copy link
Member

@nalimilan nalimilan Feb 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style guide in CONTRIBUTING.md says not to use camelCase. You can just call this e.g. dups.

EDIT: Woops, that comment is outdated, carry on.

base/set.jl Outdated
end
end
return itr
T = eltype(itr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to define T since you use it only once.

@@ -191,6 +191,24 @@ function unique(f::Callable, C)
end

"""
unique!(itr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing elements is generally not possible on iterators, so better replace itr with c (for collection).

Copy link
Member

@ararslan ararslan Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature has been changed, but not the signature in the docstring. The docstring should also be updated.

base/set.jl Outdated
@@ -191,6 +191,24 @@ function unique(f::Callable, C)
end

"""
unique!(itr)

Removes recurrences of items and returns the modified collection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use imperative "Remove" and add ending dot. Also, better say "duplicated items" rather than "recurrence". Finally, would be worth saying a word about the fact that isequal is used (like for unique), and adding one or two examples (see how to do so in the Documentation section of the manual).

@JackDevine
Copy link
Contributor

I made a pull request based on the suggestion from @pabloferz , I did some tests and his version was very fast, but more importantly it should generalize quite well.

On the topic of generalization, there is one thing that I noticed

function unique!(c)
    seen = Set{eltype(c)}()
    idxs = eachindex(c)
    m = n = start(idxs)
    count = 0
    @inbounds while !done(c, n)
        i, n = next(idxs, n)
        x = c[i]
        if x  seen
            count += 1
            push!(seen, x)
            j, m = next(idxs, m)
            c[j] = x
        end
    end
    resize!(c, count)
end

s1 = "The quick brown fox jumps over the lazy dog α,β,γ"
r = eachmatch(r"[\w]{4,}", s1)

julia> unique(r)
5-element Array{RegexMatch,1}:
 RegexMatch("quick")
 RegexMatch("brown")
 RegexMatch("jumps")
 RegexMatch("over") 
 RegexMatch("lazy")
julia> unique!(r)
LoadError: MethodError: no method matching eachindex(::Base.RegexMatchIterator)
Closest candidates are:
  eachindex(!Matched::Tuple) at tuple.jl:19
  eachindex(!Matched::Tuple, !Matched::Tuple...) at tuple.jl:22
  eachindex(!Matched::AbstractArray{T,1}) at abstractarray.jl:679
  ...
while loading In[7], in expression starting on line 5

 in uniquejd!(::Base.RegexMatchIterator) at .\In[2]:3

It looks like eachindex doesn't work on the regex match iterator, I don't know how much of a big deal this is, but I thought it was worth letting people know.

@nalimilan
Copy link
Member

Anyway you cannot modify in-place a RegexMatchIterator, so you would have obtained an error later.

@JackDevine
Copy link
Contributor

Ok thanks for that clarification, I guess that unique has a broader scope than unique!.

test/sets.jl Outdated
@test in(1,u)
@test in(2,u)
@test in(3,u)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just needs to test the actual value of u. That ensures that the length and order of occurrence is correct. In doing so I recommend swapping 3 and 2 in u to make it clearer. Perhaps something like

@testset "unique!" begin
    u = [1, 1, 3, 2, 1]
    unique!(u)
    @test u == [1, 3, 2]
end

Currently these tests don't test that the order of occurrence of the unique values is preserved.

test/sets.jl Outdated
@@ -216,6 +216,13 @@ u = unique([1,1,2])
@test @inferred(unique(x for x in 1:1)) == [1]
@test unique(x for x in Any[1,1.0])::Vector{Real} == [1]

# unique!
@testset "unique!" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a trailing space after begin, which is causing make check-whitespace to fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also the comment above is redundant with the name of the test set)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already updated the commit :P

base/set.jl Outdated
1-element Array{Int64,1}:
1

julia> unique!([7, 3, 2, 3, 7, 5])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really pedantic, but there's an extra space here between julia> and unique!.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks for noticing that, I just copied some of the examples from before.

Copy link
Member

@fredrikekre fredrikekre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use the implementation suggested by @JackDevine/@pabloferz which proved to be more efficient?

base/set.jl Outdated
1-element Array{Int64,1}:
1

julia> unique!([7, 3, 2, 3, 7, 5])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better example would be:

julia>  A = [7, 3, 2, 3, 7, 5];

julia> unique!(A);

julia> A
4-element Array{Int64,1}:
 7
 3
 2
 5

To show that it is changing A. Otherwise this example would look exactly like unique behaves

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I thought that we were using the version with less allocation, that was what my pull request was for. This is my first time contributing so I think that I may have done some things wrong with my pull request and stepped slightly outside my skill range.

Would you say that the right thing to do is to just make a new pull request with the Jack/Pablo version of unique! with the updated documentation and adding unique! to base/exports.jl?

@ararslan
Copy link
Member

unique! also needs to be added to base/exports.jl. The tests are failing because the function isn't exported.

@StefanKarpinski
Copy link
Member

I don't think there's a strong case to be made for supporting unique! on non-arrays. You basically need sequential integer indexing to make this work. Let's please start with that and generalize it when we have actual cases where it makes sense beyond that scope.

@JackDevine
Copy link
Contributor

I made a new pull request with an updated version of unique! as well as adding unique! to base/exports.jl. Once I had made the changes I built Julia on my machine and it didn't complain, so I think that things are hunki dori. The latest version also adds the fast route for sorted data that @StefanKarpinski mentioned. The particular way that I dealt with sorted data does not actually require that the data is sorted, only that it is grouped. What I mean is that you don't need the data to look like
[1, 1, 2, 2, 2, 3, 3]
to do things efficiently, you only need something like
[1, 1, 3, 3, 2, 2, 2].
This is a slightly weaker condition, so the fast track may be available to more collections than we had thought. There seems to be some contention over the version of unique! that we are using and whether or not we should apply it to general collections straight away. My pull request does deal with collections as long as they can be sorted.

@nalimilan
Copy link
Member

Closing in favor of #20619.

@nalimilan nalimilan closed this Jun 9, 2017
@JackDevine JackDevine mentioned this pull request Jun 9, 2017
@kvmanohar22 kvmanohar22 deleted the Unique branch June 9, 2017 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets needs tests Unit tests are required for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants