[WIP] Adding `unique!` functionality #20576

kvmanohar22 · 2017-02-11T21:24:37Z

In Ref #20549
Adding a naive implementation to unique! function, will be optimized

ararslan · 2017-02-11T22:04:36Z

base/set.jl

+Removes recurrences of items and returns the modified collection
+"""
+function unique!(itr)
+	seen = Set{eltype(itr)}()


Please use 4 spaces rather than tabs for indentation for consistency with the rest of the repo.

Should have noted that, will update it in the next commit

kvmanohar22 · 2017-02-12T20:13:29Z

Tweaked the method a bit, directly deleting the elements from itr (as in first implementation) was giving BoundsError. So, currently I'm storing all the duplicateIndices in a vector and finally deleting the duplicates from original itr
Changed the tabs to spaces.

ararslan · 2017-02-12T20:22:55Z

base/set.jl

-		end
-	end
-	return itr
+   T = eltype(itr)


Please use 4 spaces for indentation instead of 3 for consistency with the rest of the repo.

Do I need to make any changes to the method I have implemented ?
I will add test cases soon. Will update the indentation in the next commit

Tetralux

This could use some optimization.

julia> A = [rand(1:10) for _ in 1:100_000];

julia> unique!(Int[]); unique(Int[]); @time(1) # JIT warmup

julia> @time unique(A);
  0.002214 seconds (13 allocations: 1008 bytes)

julia> @time unique!(A); # from this PR
  2.650326 seconds (26 allocations: 2.001 MiB)

One reason for this is probably since deleteat! shifts the remaining elements by one to overwrite the element you wish to delete.
You don't want this shift to occur a lot, so it should be called as little as possible, and provide it a range to delete as the second parameter: deleteat!(itr, 1:5), for example.

ararslan · 2017-02-12T22:02:15Z

Note that functions that modify their arguments only make sense for mutable objects. ~~so a check on isimmutable may be in order.~~ Or the signature could be restricted to Arrays, which are guaranteed to be mutable.

This should be somewhat more efficient. Note that the deletion happens all at once at the end using a single call to deleteat!.

function unique!(itr)
    seen = Set{eltype(itr)}()
    index_iterator = eachindex(itr)
    inds = Set{eltype(index_iterator)}()
    @inbounds for i in index_iterator
        x = itr[i]
        x in seen ? push!(inds, i) : push!(seen, x)
    end
    deleteat!(itr, inds)
    return itr
end

Edit: Updated after a discussion with @mbauman.

fredrikekre · 2017-02-12T22:45:25Z

Unfortunately ~~deleteat! does not accept a Set~~ that gives an error:

julia> unique!(A)
ERROR: ArgumentError: indices must be unique and sorted
Stacktrace:
 [1] _deleteat!(::Array{Int64,1}, ::Set{Int64}) at ./array.jl:834
 [2] unique!(::Array{Int64,1}) at ./REPL[1]:9

ararslan · 2017-02-12T22:51:32Z

Yeah, just realized that. That's why I originally had IntSet, but that won't handle OffsetArrays since the set of indices may not be strictly positive.

inds = Vector{eltype(index_iterator)}()

should do it, then.

kvmanohar22 · 2017-02-12T23:09:16Z

Yes, that works !
Here's the comparison

with the earlier implementation

julia> A = [rand(1:10) for _ in 1:100_000];
julia> @time unique!(A);
  0.037155 seconds (6.46 k allocations: 2.266 MB)

with the proposed implementation (deleting all the indices at once)

julia> A = [rand(1:10) for _ in 1:100_000];
julia> @time unique!(A);
  0.007427 seconds (26 allocations: 2.001 MB)

Tetralux · 2017-02-12T23:23:18Z

Updated stats:

julia> A = [rand(1:10) for _ in 1:100_000];

julia> function unique!(itr)
           seen = Set{eltype(itr)}()
           index_iterator = eachindex(itr)
           inds = Vector{eltype(index_iterator)}()
           @inbounds for i in index_iterator
               x = itr[i]
               x in seen ? push!(inds, i) : push!(seen, x)
           end
           deleteat!(itr, inds)
           return itr
       end
unique! (generic function with 1 method)

julia> @time(1); unique!(Int[]); unique(Int[]); # JIT warmup
  0.000004 seconds (107 allocations: 7.266 KiB)

julia> @time unique(A); @time unique!(A);
  0.002006 seconds (13 allocations: 1008 bytes)
  0.003708 seconds (26 allocations: 2.001 MiB)

Allocations are a little high still. :-)

fredrikekre · 2017-02-12T23:25:42Z

Allocations are a little high still. :-)

Isn't that inevitable? For unique! all the duplicate indices need to be stored, but for unique they can just be forgotten.

kvmanohar22 · 2017-02-12T23:26:49Z

How about we raise an exception when we encounter immutable objects ?
Something like this

...
 function unique!(itr)
    if isimmutable(itr)
        throw("Collection contains immutable objects")
    T = eltype(itr)
    seen = Set{T}()
    index_iterator = eachindex(itr)
    ...

ararslan · 2017-02-12T23:30:31Z

I originally had that in my proposed implementation as well, but that doesn't quite do what we want.

julia> immutable A{T} <: AbstractVector{Int}
           data::T
       end
       Base.size(X::A) = size(X.data)
       Base.getindex(X::A, i::Int) = X.data[i]
       Base.setindex!(X::A, v, i::Int) = setindex!(X.data, v, i)

julia> x = A([1,2,3])
3-element A{Array{Int64,1}}:
 1
 2
 3

julia> isimmutable(x)
true

julia> x[2] = 0
0

julia> x
3-element A{Array{Int64,1}}:
 1
 0
 3

It's probably better to just let deleteat! fail on actually immutable inputs.

StefanKarpinski · 2017-02-12T23:37:42Z

I would write this for arrays first, make it fast and then figure out how to generalize it by looking at the kinds of operations one needs. The array version should not need to allocate except for the seen set.

StefanKarpinski · 2017-02-12T23:41:42Z

As extra credit, if the values are all in order or reverse order, no allocation is necessary at all. Since that's a fairly common case, it would be well worth having a fast path for it. It's a bit tricky to implement that switch from the fast path to the slow path though.

JackDevine · 2017-02-13T09:58:56Z

Hi all,

I am a little late to the party I know, but I managed to implement the function without needing to allocate besides the seen set as described by @StefanKarpinski . Here is my code:

function uniquejd!(itr)  # I used my initials to distinguish the function.
    seen = Set{eltype(itr)}()
    count = 0  # Number of unique values seen so far.
    for x in itr
        @inbounds if ∉(x, seen)
            count += 1            
            push!(seen, x)
            itr[count] = x
        end
    end
    resize!(itr, count)
end

function uniqueh225!(itr)
    seen = Set{eltype(itr)}()
    index_iterator = eachindex(itr)
    inds = Vector{eltype(index_iterator)}()
    @inbounds for i in index_iterator
       x = itr[i]
       x in seen ? push!(inds, i) : push!(seen, x)
       end
    deleteat!(itr, inds)
    return itr
end

If I then do the performance tests that @h-225 did, I get the following:

julia> srand(42); A = [rand(1:10) for _ in 1:100_000];
julia> uniquejd!(Int[])
julia> @time uniquejd!(A);
 0.001507 seconds (14 allocations: 1024 bytes)
julia> srand(42); A = [rand(1:10) for _ in 1:100_000];
julia> uniquejd!(Int[])
julia> @time uniqueh225!(A);
  0.002825 seconds (31 allocations: 2.002 MB)

So roughly a factor of two speed up and much less allocation.
I am using a 2013 MBP with Julia v0.5.0

KristofferC · 2017-02-13T10:12:17Z

Some more (maybe a bit more thorough) benchmarking:

using BenchmarkTools
# The copy is insignificant
fh22d(A) = (B = copy(A); uniqueh225!(B))
fjd(A) = (B = copy(A); uniquejd!(B))

julia> @btime fh22d($A)
  1.915 ms (24 allocations: 2.76 MiB)

julia> @btime fjd($A)
  1.276 ms (7 allocations: 781.81 KiB)

julia> @btime unique($A)
  1.128 ms (9 allocations: 848 bytes)

Tetralux · 2017-02-13T10:17:41Z

@JackDevine
Funnily enough, I was at a loose end earlier and was wondering about this.
So I also had a go. 😄

function unique!(A::DenseArray)
    seen        = Set{eltype(A)}()
    orig_length = length(A)
    offset      = 0

    for i in eachindex(A)
        if i + offset > orig_length
            break
        end

        # Copy the next unique element into the current position, and
        # keep track of how many indices we have skipped.
        while A[i + offset] in seen
            offset += 1
            if i + offset > orig_length
                @goto break_outer_loop
            end
        end
        if offset > 0
            A[i] = A[i + offset]
        end

        push!(seen, A[i])
    end
    @label break_outer_loop
    return resize!(A, orig_length - offset)
end

julia> @time Int[]
  0.000003 seconds (5 allocations: 240 bytes)
0-element Array{Int64,1}

julia> unique!(Int[]);

julia> A = [rand(1:10) for _ in 1:100_000];

julia> @time unique!(A)
  0.002092 seconds (9 allocations: 656 bytes)

pabloferz · 2017-02-13T10:44:23Z

@JackDevine, that's a nice proposal. For completeness, an equally fast version that handles more general iterables would be

function unique!(itr)
    seen = Set{eltype(itr)}()
    idxs = eachindex(itr)
    m = n = start(idxs)
    count = 0
    @inbounds while !done(idxs, n)
        i, n = next(idxs, n)
        x = itr[i]
        if x ∉ seen
            count += 1
            push!(seen, x)
            j, m = next(idxs, m)
            itr[j] = x
        end
    end
    resize!(itr, count)
end

nalimilan · 2017-02-13T15:03:16Z

base/set.jl

-	return itr
+   T = eltype(itr)
+   seen = Set{T}()
+   duplicateIndex = Vector{Int64}()


The style guide in CONTRIBUTING.md says not to use camelCase. You can just call this e.g. dups.

EDIT: Woops, that comment is outdated, carry on.

nalimilan · 2017-02-13T15:03:45Z

base/set.jl

-		end
-	end
-	return itr
+   T = eltype(itr)


No need to define T since you use it only once.

nalimilan · 2017-02-13T15:05:50Z

base/set.jl

@@ -191,6 +191,24 @@ function unique(f::Callable, C)
 end

 """
+	unique!(itr)


Removing elements is generally not possible on iterators, so better replace itr with c (for collection).

The signature has been changed, but not the signature in the docstring. The docstring should also be updated.

nalimilan · 2017-02-13T15:09:04Z

base/set.jl

@@ -191,6 +191,24 @@ function unique(f::Callable, C)
 end

 """
+	unique!(itr)
+
+Removes recurrences of items and returns the modified collection


Use imperative "Remove" and add ending dot. Also, better say "duplicated items" rather than "recurrence". Finally, would be worth saying a word about the fact that isequal is used (like for unique), and adding one or two examples (see how to do so in the Documentation section of the manual).

JackDevine · 2017-02-13T21:37:01Z

I made a pull request based on the suggestion from @pabloferz , I did some tests and his version was very fast, but more importantly it should generalize quite well.

On the topic of generalization, there is one thing that I noticed

function unique!(c)
    seen = Set{eltype(c)}()
    idxs = eachindex(c)
    m = n = start(idxs)
    count = 0
    @inbounds while !done(c, n)
        i, n = next(idxs, n)
        x = c[i]
        if x ∉ seen
            count += 1
            push!(seen, x)
            j, m = next(idxs, m)
            c[j] = x
        end
    end
    resize!(c, count)
end

s1 = "The quick brown fox jumps over the lazy dog α,β,γ"
r = eachmatch(r"[\w]{4,}", s1)

julia> unique(r)
5-element Array{RegexMatch,1}:
 RegexMatch("quick")
 RegexMatch("brown")
 RegexMatch("jumps")
 RegexMatch("over") 
 RegexMatch("lazy")
julia> unique!(r)
LoadError: MethodError: no method matching eachindex(::Base.RegexMatchIterator)
Closest candidates are:
  eachindex(!Matched::Tuple) at tuple.jl:19
  eachindex(!Matched::Tuple, !Matched::Tuple...) at tuple.jl:22
  eachindex(!Matched::AbstractArray{T,1}) at abstractarray.jl:679
  ...
while loading In[7], in expression starting on line 5

 in uniquejd!(::Base.RegexMatchIterator) at .\In[2]:3

It looks like eachindex doesn't work on the regex match iterator, I don't know how much of a big deal this is, but I thought it was worth letting people know.

nalimilan · 2017-02-13T22:01:29Z

Anyway you cannot modify in-place a RegexMatchIterator, so you would have obtained an error later.

JackDevine · 2017-02-13T22:03:46Z

Ok thanks for that clarification, I guess that unique has a broader scope than unique!.

ararslan · 2017-02-14T15:44:03Z

test/sets.jl

+@test in(1,u)
+@test in(2,u)
+@test in(3,u)
+


This just needs to test the actual value of u. That ensures that the length and order of occurrence is correct. In doing so I recommend swapping 3 and 2 in u to make it clearer. Perhaps something like

@testset "unique!" begin u = [1, 1, 3, 2, 1] unique!(u) @test u == [1, 3, 2] end

Currently these tests don't test that the order of occurrence of the unique values is preserved.

ararslan · 2017-02-14T16:32:08Z

test/sets.jl

@@ -216,6 +216,13 @@ u = unique([1,1,2])
 @test @inferred(unique(x for x in 1:1)) == [1]
 @test unique(x for x in Any[1,1.0])::Vector{Real} == [1]

+# unique!
+@testset "unique!" begin 


There's a trailing space after begin, which is causing make check-whitespace to fail.

(also the comment above is redundant with the name of the test set)

I already updated the commit :P

ararslan · 2017-02-14T16:50:58Z

base/set.jl

+1-element Array{Int64,1}:
+ 1
+
+julia>  unique!([7, 3, 2, 3, 7, 5])


Really pedantic, but there's an extra space here between julia> and unique!.

Yeah, thanks for noticing that, I just copied some of the examples from before.

fredrikekre

Shouldn't we use the implementation suggested by @JackDevine/@pabloferz which proved to be more efficient?

fredrikekre · 2017-02-14T16:54:45Z

base/set.jl

+1-element Array{Int64,1}:
+ 1
+
+julia>  unique!([7, 3, 2, 3, 7, 5])


I think a better example would be:

julia> A = [7, 3, 2, 3, 7, 5]; julia> unique!(A); julia> A 4-element Array{Int64,1}: 7 3 2 5

To show that it is changing A. Otherwise this example would look exactly like unique behaves

Right, I thought that we were using the version with less allocation, that was what my pull request was for. This is my first time contributing so I think that I may have done some things wrong with my pull request and stepped slightly outside my skill range.

Would you say that the right thing to do is to just make a new pull request with the Jack/Pablo version of unique! with the updated documentation and adding unique! to base/exports.jl?

ararslan · 2017-02-14T20:49:08Z

unique! also needs to be added to base/exports.jl. The tests are failing because the function isn't exported.

StefanKarpinski · 2017-02-15T20:39:49Z

I don't think there's a strong case to be made for supporting unique! on non-arrays. You basically need sequential integer indexing to make this work. Let's please start with that and generalize it when we have actual cases where it makes sense beyond that scope.

JackDevine · 2017-02-15T22:54:38Z

I made a new pull request with an updated version of unique! as well as adding unique! to base/exports.jl. Once I had made the changes I built Julia on my machine and it didn't complain, so I think that things are hunki dori. The latest version also adds the fast route for sorted data that @StefanKarpinski mentioned. The particular way that I dealt with sorted data does not actually require that the data is sorted, only that it is grouped. What I mean is that you don't need the data to look like
[1, 1, 2, 2, 2, 3, 3]
to do things efficiently, you only need something like
[1, 1, 3, 3, 2, 2, 2].
This is a slightly weaker condition, so the fast track may be available to more collections than we had thought. There seems to be some contention over the version of unique! that we are using and whether or not we should apply it to general collections straight away. My pull request does deal with collections as long as they can be sorted.

nalimilan · 2017-06-09T08:15:33Z

Closing in favor of #20619.

Adding unique! functionality

efa33b7

ararslan added collections Data structures holding multiple items, e.g. sets needs tests Unit tests are required for this change labels Feb 11, 2017

ararslan reviewed Feb 11, 2017

View reviewed changes

modified the method a bit

e973e8b

ararslan reviewed Feb 12, 2017

View reviewed changes

Tetralux suggested changes Feb 12, 2017

View reviewed changes

More efficient method to delete duplicate indices

97b4797

nalimilan reviewed Feb 13, 2017

View reviewed changes

base/set.jl Outdated

end

end

return itr

T = eltype(itr)

Copy link

Member

nalimilan Feb 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to define T since you use it only once.

nalimilan reviewed Feb 13, 2017

View reviewed changes

Adding doc examples and small changes

905e877

Adding test cases

892552b

ararslan requested changes Feb 14, 2017

View reviewed changes

update tests

95e4c19

ararslan reviewed Feb 14, 2017

View reviewed changes

remove trailing whitespace

1183329

ararslan reviewed Feb 14, 2017

View reviewed changes

fredrikekre reviewed Feb 14, 2017

View reviewed changes

adding unique! to exports

9ff810c

fredrikekre mentioned this pull request Feb 19, 2017

Fix sparse setindex issue #20677

Merged

nalimilan closed this Jun 9, 2017

JackDevine mentioned this pull request Jun 9, 2017

Add unique! #20619

Merged

kvmanohar22 deleted the Unique branch June 9, 2017 23:00

[WIP] Adding unique! functionality #20576

[WIP] Adding unique! functionality #20576

Conversation

kvmanohar22 commented Feb 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kvmanohar22 commented Feb 12, 2017

Choose a reason for hiding this comment

kvmanohar22 Feb 12, 2017 • edited Loading

Choose a reason for hiding this comment

Tetralux left a comment

Choose a reason for hiding this comment

ararslan commented Feb 12, 2017 • edited Loading

fredrikekre commented Feb 12, 2017 • edited Loading

ararslan commented Feb 12, 2017

kvmanohar22 commented Feb 12, 2017 • edited Loading

Tetralux commented Feb 12, 2017

fredrikekre commented Feb 12, 2017

kvmanohar22 commented Feb 12, 2017 • edited Loading

ararslan commented Feb 12, 2017

StefanKarpinski commented Feb 12, 2017

StefanKarpinski commented Feb 12, 2017

JackDevine commented Feb 13, 2017

KristofferC commented Feb 13, 2017 • edited Loading

Tetralux commented Feb 13, 2017

pabloferz commented Feb 13, 2017 • edited Loading

nalimilan Feb 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ararslan Feb 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackDevine commented Feb 13, 2017

nalimilan commented Feb 13, 2017

JackDevine commented Feb 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fredrikekre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ararslan commented Feb 14, 2017

StefanKarpinski commented Feb 15, 2017

JackDevine commented Feb 15, 2017

nalimilan commented Jun 9, 2017

[WIP] Adding `unique!` functionality #20576

[WIP] Adding `unique!` functionality #20576

kvmanohar22 commented Feb 11, 2017 •

edited

Loading

kvmanohar22 Feb 12, 2017 •

edited

Loading

ararslan commented Feb 12, 2017 •

edited

Loading

fredrikekre commented Feb 12, 2017 •

edited

Loading

kvmanohar22 commented Feb 12, 2017 •

edited

Loading

kvmanohar22 commented Feb 12, 2017 •

edited

Loading

KristofferC commented Feb 13, 2017 •

edited

Loading

pabloferz commented Feb 13, 2017 •

edited

Loading

nalimilan Feb 13, 2017 •

edited

Loading

ararslan Feb 14, 2017 •

edited

Loading