-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Adding unique!
functionality
#20576
Conversation
base/set.jl
Outdated
Removes recurrences of items and returns the modified collection | ||
""" | ||
function unique!(itr) | ||
seen = Set{eltype(itr)}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use 4 spaces rather than tabs for indentation for consistency with the rest of the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have noted that, will update it in the next commit
Tweaked the method a bit, directly deleting the elements from |
base/set.jl
Outdated
end | ||
end | ||
return itr | ||
T = eltype(itr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use 4 spaces for indentation instead of 3 for consistency with the rest of the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I need to make any changes to the method I have implemented ?
I will add test cases soon. Will update the indentation in the next commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use some optimization.
julia> A = [rand(1:10) for _ in 1:100_000];
julia> unique!(Int[]); unique(Int[]); @time(1) # JIT warmup
julia> @time unique(A);
0.002214 seconds (13 allocations: 1008 bytes)
julia> @time unique!(A); # from this PR
2.650326 seconds (26 allocations: 2.001 MiB)
One reason for this is probably since deleteat!
shifts the remaining elements by one to overwrite the element you wish to delete.
You don't want this shift to occur a lot, so it should be called as little as possible, and provide it a range to delete as the second parameter: deleteat!(itr, 1:5)
, for example.
Note that functions that modify their arguments only make sense for mutable objects. This should be somewhat more efficient. Note that the deletion happens all at once at the end using a single call to function unique!(itr)
seen = Set{eltype(itr)}()
index_iterator = eachindex(itr)
inds = Set{eltype(index_iterator)}()
@inbounds for i in index_iterator
x = itr[i]
x in seen ? push!(inds, i) : push!(seen, x)
end
deleteat!(itr, inds)
return itr
end Edit: Updated after a discussion with @mbauman. |
Unfortunately julia> unique!(A)
ERROR: ArgumentError: indices must be unique and sorted
Stacktrace:
[1] _deleteat!(::Array{Int64,1}, ::Set{Int64}) at ./array.jl:834
[2] unique!(::Array{Int64,1}) at ./REPL[1]:9 |
Yeah, just realized that. That's why I originally had inds = Vector{eltype(index_iterator)}() should do it, then. |
Yes, that works ! with the earlier implementation julia> A = [rand(1:10) for _ in 1:100_000];
julia> @time unique!(A);
0.037155 seconds (6.46 k allocations: 2.266 MB) with the proposed implementation (deleting all the indices at once) julia> A = [rand(1:10) for _ in 1:100_000];
julia> @time unique!(A);
0.007427 seconds (26 allocations: 2.001 MB)
|
Updated stats: julia> A = [rand(1:10) for _ in 1:100_000];
julia> function unique!(itr)
seen = Set{eltype(itr)}()
index_iterator = eachindex(itr)
inds = Vector{eltype(index_iterator)}()
@inbounds for i in index_iterator
x = itr[i]
x in seen ? push!(inds, i) : push!(seen, x)
end
deleteat!(itr, inds)
return itr
end
unique! (generic function with 1 method)
julia> @time(1); unique!(Int[]); unique(Int[]); # JIT warmup
0.000004 seconds (107 allocations: 7.266 KiB)
julia> @time unique(A); @time unique!(A);
0.002006 seconds (13 allocations: 1008 bytes)
0.003708 seconds (26 allocations: 2.001 MiB) Allocations are a little high still. :-) |
Isn't that inevitable? For |
How about we raise an exception when we encounter immutable objects ? ...
function unique!(itr)
if isimmutable(itr)
throw("Collection contains immutable objects")
T = eltype(itr)
seen = Set{T}()
index_iterator = eachindex(itr)
... |
I originally had that in my proposed implementation as well, but that doesn't quite do what we want. julia> immutable A{T} <: AbstractVector{Int}
data::T
end
Base.size(X::A) = size(X.data)
Base.getindex(X::A, i::Int) = X.data[i]
Base.setindex!(X::A, v, i::Int) = setindex!(X.data, v, i)
julia> x = A([1,2,3])
3-element A{Array{Int64,1}}:
1
2
3
julia> isimmutable(x)
true
julia> x[2] = 0
0
julia> x
3-element A{Array{Int64,1}}:
1
0
3 It's probably better to just let |
I would write this for arrays first, make it fast and then figure out how to generalize it by looking at the kinds of operations one needs. The array version should not need to allocate except for the seen set. |
As extra credit, if the values are all in order or reverse order, no allocation is necessary at all. Since that's a fairly common case, it would be well worth having a fast path for it. It's a bit tricky to implement that switch from the fast path to the slow path though. |
Hi all, I am a little late to the party I know, but I managed to implement the function without needing to allocate besides the function uniquejd!(itr) # I used my initials to distinguish the function.
seen = Set{eltype(itr)}()
count = 0 # Number of unique values seen so far.
for x in itr
@inbounds if ∉(x, seen)
count += 1
push!(seen, x)
itr[count] = x
end
end
resize!(itr, count)
end
function uniqueh225!(itr)
seen = Set{eltype(itr)}()
index_iterator = eachindex(itr)
inds = Vector{eltype(index_iterator)}()
@inbounds for i in index_iterator
x = itr[i]
x in seen ? push!(inds, i) : push!(seen, x)
end
deleteat!(itr, inds)
return itr
end If I then do the performance tests that @h-225 did, I get the following: julia> srand(42); A = [rand(1:10) for _ in 1:100_000];
julia> uniquejd!(Int[])
julia> @time uniquejd!(A);
0.001507 seconds (14 allocations: 1024 bytes)
julia> srand(42); A = [rand(1:10) for _ in 1:100_000];
julia> uniquejd!(Int[])
julia> @time uniqueh225!(A);
0.002825 seconds (31 allocations: 2.002 MB) So roughly a factor of two speed up and much less allocation. |
Some more (maybe a bit more thorough) benchmarking: using BenchmarkTools
# The copy is insignificant
fh22d(A) = (B = copy(A); uniqueh225!(B))
fjd(A) = (B = copy(A); uniquejd!(B))
julia> @btime fh22d($A)
1.915 ms (24 allocations: 2.76 MiB)
julia> @btime fjd($A)
1.276 ms (7 allocations: 781.81 KiB)
julia> @btime unique($A)
1.128 ms (9 allocations: 848 bytes) |
@JackDevine function unique!(A::DenseArray)
seen = Set{eltype(A)}()
orig_length = length(A)
offset = 0
for i in eachindex(A)
if i + offset > orig_length
break
end
# Copy the next unique element into the current position, and
# keep track of how many indices we have skipped.
while A[i + offset] in seen
offset += 1
if i + offset > orig_length
@goto break_outer_loop
end
end
if offset > 0
A[i] = A[i + offset]
end
push!(seen, A[i])
end
@label break_outer_loop
return resize!(A, orig_length - offset)
end julia> @time Int[]
0.000003 seconds (5 allocations: 240 bytes)
0-element Array{Int64,1}
julia> unique!(Int[]);
julia> A = [rand(1:10) for _ in 1:100_000];
julia> @time unique!(A)
0.002092 seconds (9 allocations: 656 bytes) |
@JackDevine, that's a nice proposal. For completeness, an equally fast version that handles more general iterables would be function unique!(itr)
seen = Set{eltype(itr)}()
idxs = eachindex(itr)
m = n = start(idxs)
count = 0
@inbounds while !done(idxs, n)
i, n = next(idxs, n)
x = itr[i]
if x ∉ seen
count += 1
push!(seen, x)
j, m = next(idxs, m)
itr[j] = x
end
end
resize!(itr, count)
end |
base/set.jl
Outdated
return itr | ||
T = eltype(itr) | ||
seen = Set{T}() | ||
duplicateIndex = Vector{Int64}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style guide in CONTRIBUTING.md says not to use camelCase. You can just call this e.g. dups
.
EDIT: Woops, that comment is outdated, carry on.
base/set.jl
Outdated
end | ||
end | ||
return itr | ||
T = eltype(itr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to define T
since you use it only once.
@@ -191,6 +191,24 @@ function unique(f::Callable, C) | |||
end | |||
|
|||
""" | |||
unique!(itr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing elements is generally not possible on iterators, so better replace itr
with c
(for collection).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature has been changed, but not the signature in the docstring. The docstring should also be updated.
base/set.jl
Outdated
@@ -191,6 +191,24 @@ function unique(f::Callable, C) | |||
end | |||
|
|||
""" | |||
unique!(itr) | |||
|
|||
Removes recurrences of items and returns the modified collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use imperative "Remove" and add ending dot. Also, better say "duplicated items" rather than "recurrence". Finally, would be worth saying a word about the fact that isequal
is used (like for unique
), and adding one or two examples (see how to do so in the Documentation section of the manual).
I made a pull request based on the suggestion from @pabloferz , I did some tests and his version was very fast, but more importantly it should generalize quite well. On the topic of generalization, there is one thing that I noticed function unique!(c)
seen = Set{eltype(c)}()
idxs = eachindex(c)
m = n = start(idxs)
count = 0
@inbounds while !done(c, n)
i, n = next(idxs, n)
x = c[i]
if x ∉ seen
count += 1
push!(seen, x)
j, m = next(idxs, m)
c[j] = x
end
end
resize!(c, count)
end
s1 = "The quick brown fox jumps over the lazy dog α,β,γ"
r = eachmatch(r"[\w]{4,}", s1)
julia> unique(r)
5-element Array{RegexMatch,1}:
RegexMatch("quick")
RegexMatch("brown")
RegexMatch("jumps")
RegexMatch("over")
RegexMatch("lazy")
julia> unique!(r)
LoadError: MethodError: no method matching eachindex(::Base.RegexMatchIterator)
Closest candidates are:
eachindex(!Matched::Tuple) at tuple.jl:19
eachindex(!Matched::Tuple, !Matched::Tuple...) at tuple.jl:22
eachindex(!Matched::AbstractArray{T,1}) at abstractarray.jl:679
...
while loading In[7], in expression starting on line 5
in uniquejd!(::Base.RegexMatchIterator) at .\In[2]:3 It looks like eachindex doesn't work on the regex match iterator, I don't know how much of a big deal this is, but I thought it was worth letting people know. |
Anyway you cannot modify in-place a |
Ok thanks for that clarification, I guess that |
test/sets.jl
Outdated
@test in(1,u) | ||
@test in(2,u) | ||
@test in(3,u) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just needs to test the actual value of u
. That ensures that the length and order of occurrence is correct. In doing so I recommend swapping 3 and 2 in u
to make it clearer. Perhaps something like
@testset "unique!" begin
u = [1, 1, 3, 2, 1]
unique!(u)
@test u == [1, 3, 2]
end
Currently these tests don't test that the order of occurrence of the unique values is preserved.
test/sets.jl
Outdated
@@ -216,6 +216,13 @@ u = unique([1,1,2]) | |||
@test @inferred(unique(x for x in 1:1)) == [1] | |||
@test unique(x for x in Any[1,1.0])::Vector{Real} == [1] | |||
|
|||
# unique! | |||
@testset "unique!" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a trailing space after begin
, which is causing make check-whitespace
to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also the comment above is redundant with the name of the test set)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already updated the commit :P
base/set.jl
Outdated
1-element Array{Int64,1}: | ||
1 | ||
|
||
julia> unique!([7, 3, 2, 3, 7, 5]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really pedantic, but there's an extra space here between julia>
and unique!
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thanks for noticing that, I just copied some of the examples from before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use the implementation suggested by @JackDevine
/@pabloferz
which proved to be more efficient?
base/set.jl
Outdated
1-element Array{Int64,1}: | ||
1 | ||
|
||
julia> unique!([7, 3, 2, 3, 7, 5]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better example would be:
julia> A = [7, 3, 2, 3, 7, 5];
julia> unique!(A);
julia> A
4-element Array{Int64,1}:
7
3
2
5
To show that it is changing A
. Otherwise this example would look exactly like unique
behaves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I thought that we were using the version with less allocation, that was what my pull request was for. This is my first time contributing so I think that I may have done some things wrong with my pull request and stepped slightly outside my skill range.
Would you say that the right thing to do is to just make a new pull request with the Jack/Pablo version of unique!
with the updated documentation and adding unique!
to base/exports.jl?
|
I don't think there's a strong case to be made for supporting |
I made a new pull request with an updated version of |
Closing in favor of #20619. |
In Ref #20549
Adding a naive implementation to
unique!
function, will be optimized