-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SortedVector type and tests #290
base: master
Are you sure you want to change the base?
Conversation
This functionality is already available in the SortedSet container. Furthermore, SortedSet will in general be more efficient because insertion and deletion to SortedSet require O(log n) operations, whereas insertion and deletion in SortedVector require O(n) operations. In addition, Base already contains the functions: issorted, searchsorted_first, and searchsorted_last to deal with sorted vectors. If you believe that, despite what is already available. there is a use-case for the new SortedVector, then possibly you could document the use-case. Also, you might consider following the same interface provided by SortedSet and the other sorted containers. For SortedSet, the second type parameter is an order object rather than a function. |
Thanks, I see your point. One possible answer is the overhead for storing the set, e.g. julia> s = SortedSet([7, 5, 1])
SortedSet([1, 5, 7],
Base.Order.ForwardOrdering())
julia> Base.summarysize(s)
424
julia> v = SortedVector([7, 5, 1])
SortedVector([1, 5, 7])
julia> Base.summarysize(v)
32 I understand that the rotations etc. needed to balance the set can be expensive, but it's not very clear to me how to benchmark things like that which modify the structure being tested. |
Also, in my application, I will need to have multiple values that compare as being equal (with a non-standard comparison that only compares the first element of a tuple). This doesn't seem to be possible using |
My previous statement that SortedSet implements the same functionality as SortedVector is incorrect-- SortedSet{K} does not allow multiple copies of the same data item whereas SortedVector{K} allows multiplicities. Rather, SortedVector{K} could be implemented as SortedDict{K,Int}, where the Int value field contains the multiplicity. If you'd like a reasonable benchmark for a sorted set that includes insertion and/or deletion, you could code up the Sieve of Eratosthenes algorithm for finding all the prime numbers between 1 and, say, 10^7 to compare SortedSet to SortedVector. |
I can't see from the docs how to use a non-standard ordering, e.g. |
If you want a structure that will preserve multiple insertions of keys that are distinct under == but test as equals under the sort order, then your best bet would be SortedMultiDict{K,Void}. The SortedMultiDict container allows multiple keys that test as equal by the sort order. You can implement a test based on the first entry either using the Lt type from Base
and then use |
Thanks. One of the reasons I need a sorted vector or related is that I will frequently need to remove all of those entries that are larger than a certain value. Is that functionality available / easy to implement in |
In principle, it is possible to chop off an entire subtree of the tree and then rebalance, but it would be a lot of code to write. (Have a look at the |
OK thanks. I guess that would be a high complexity operation? (O(number deleted items) or similar?) |
Currently, to delete k items from a sorted multidict would require O(k*log n) operations. If new special-purpose code were written for deleting a block of k consecutive items, then I think this would require O(k + log n) operations. To delete the last k items from a vector requires only O(1) time (since it corresponds to resize). |
Just for curiosity: if you run those tests, but use |
I guess the slow part is moving the data around, which
|
Codecov Report
@@ Coverage Diff @@
## master #290 +/- ##
==========================================
- Coverage 95.9% 95.36% -0.54%
==========================================
Files 31 32 +1
Lines 2244 2265 +21
==========================================
+ Hits 2152 2160 +8
- Misses 92 105 +13
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few comments about a couple of minor issues.
There has been some discussion about having the output of sort()
be a SortedVector
type, so this is a nice start to testing that idea!
Depending on how it's used, there are probably more efficient representations as the size grows large (e.g., something like the blocked implementation of Deque
s here).
Anyway, after fixing the issues I mentioned, I think this is nice to have. Improvements could come as someone needs them or has time to implement them.
by::F | ||
|
||
function SortedVector(data::Vector{T}, by::F) | ||
new(sort(data), by) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to pass by
to sort()
here.
Should also pass through rev
and alg
as parameters to sort()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And for that matter, rev
could/should probably be a separate field here as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks.
rev
seems to be redundant with by
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rev
isn't redundant with by
: sort
takes both arguments. Would be better to make them keyword arguments, to match sort
.
It would also be useful to provide a way to sort the data in-place, and to declare that the input is already sorted. Not sure what's the best way to do this; maybe just a sorted
keyword argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the constructor shouldn't call sort
. You end up needing SortedVector(v, dont_sort=true)
, instead of simply writing SortedVector(v)
, SortedVector(sort(v))
, SortedVector(sort!(v))
as needed.
To support different orderings, SortedVector
should have a Base.Order.Ordering
type parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the constructor doesn't call sort
, you can end up with a non-sorted SortedVector
. Or am I missing something? Or just call issorted
first to see if it needs sorting? (But is that much faster than just calling sort
if it's already sorted?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I understand the point. But it should definitely not be possible to create the object without any sorting at all.
|
||
|
||
getindex(v::SortedVector, i::Int) = v.data[i] | ||
length(v::SortedVector) = length(v.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These and a few functions below can be simplified using @delegate
push!(v, (1, "x")) | ||
@test v.data == [(1, "x"), (2, "b"), (3, "z"), (4, "y"), (5, "a"), (6, "z")] | ||
|
||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a testset using
v = SortedVector([1,2,3], x->-x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also test all methods defined for the type. Can you also test that an error is thrown when calling setindex!
?
Cc: @nalimilan |
Thanks for Ccing me, @kmsquire. Indeed However, I'm not sure it would be OK to deprecate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be useful to define issorted
, IndexStyle
, and maybe other functions.
by::F | ||
|
||
function SortedVector(data::Vector{T}, by::F) | ||
new(sort(data), by) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rev
isn't redundant with by
: sort
takes both arguments. Would be better to make them keyword arguments, to match sort
.
It would also be useful to provide a way to sort the data in-place, and to declare that the input is already sorted. Not sure what's the best way to do this; maybe just a sorted
keyword argument.
A `SortedVector` behaves like a standard Julia `Vector`, except that its elements are stored in sorted order, with an optional function `by` that determines the sorting order in the same way as `Base.searchsorted`. | ||
""" | ||
immutable SortedVector{T, F<:Function} | ||
data::Vector{T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use AbstractVector
everywhere (here, via a type parameter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use case for using an AbstractVector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, one could use a DataArray
, a CategoricalArray
... Is there any reason not to support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I was just worried that there are some subtypes of AbstractVector that wouldn't support the necessary operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT you only use functions in the AbstractArray
interface, right? Anyway, even if a method fails for some peculiar types, it cannot be worse than if you only accept Vector
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, there will be a big performance hit for making the inner container an AbstractVector
. The compiler won't be able to ascertain the actual type of the inner container at compile-time so there will be a lot more run-time dispatching. To avoid this problem, you would need to make both the data type and the inner container type parameters of SortedVector
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the container type a parameter is what @nalimilan was suggesting.
@@ -0,0 +1,52 @@ | |||
""" | |||
A `SortedVector` behaves like a standard Julia `Vector`, except that its elements are stored in sorted order, with an optional function `by` that determines the sorting order in the same way as `Base.searchsorted`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap at 92 chars. Also, better refer to sort
than to searchsorted
. The docstring should also mention explicitly what happens on insert.
length(v::SortedVector) = length(v.data) | ||
|
||
|
||
function insert!{T}(v::SortedVector{T}, i::Int, x::T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't use ::T
, just leave push!
do the conversion automatically if possible. Same below.
Though this method should probably throw an error, since it doesn't respect the contract of inserting x
at position i
.
|
||
|
||
function insert!{T}(v::SortedVector{T}, i::Int, x::T) | ||
push!(v.data, i, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i
is incorrect here, right?
@@ -102,6 +104,8 @@ module DataStructures | |||
|
|||
include("priorityqueue.jl") | |||
|
|||
include("sorted_vector.jl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No underscore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a look at the other file names in the package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny, my comment was based on priorityqueue.jl
, which is the only example shown in the diff, and apparently also the only exception.
push!(v, (1, "x")) | ||
@test v.data == [(1, "x"), (2, "b"), (3, "z"), (4, "y"), (5, "a"), (6, "z")] | ||
|
||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also test all methods defined for the type. Can you also test that an error is thrown when calling setindex!
?
|
||
|
||
SortedVector{T,F}(data::Vector{T}, by::F) = SortedVector{T,F}(data, by) | ||
SortedVector{T}(data::Vector{T}) = SortedVector{T,typeof(identity)}(data, identity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get rid of this function by adding by=identity
to the previous one.
|
||
function push!{T}(v::SortedVector{T}, x::T) | ||
i = searchsortedfirst(v.data, x, by=v.by) | ||
insert!(v.data, i, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should just check that the pushed item is greater than the current last item in the vector. Otherwise it doesn't implement the contract that pop!
returns the last thing you push!
ed.
The documentation for pop! says that it returns the last element in the structure, not the last element added to the structure. To me push! just means "add this element to the data structure"; I don't see why it should be in the last place. |
Many thanks for the detailed review @nalimilan. I don't think I'll have time to do all that any time soon though. |
@nalimilan said
I agree. I had a longer response with a more detailed proposal for what specifically to do with |
bump |
Actually, |
Sorry, dropped the ball on this one. |
I am interested in reviving this, since it would be nice to have a vector-like container that simply indicates that the contents are sorted. However, from the long discussion above it seems there was some resistance to the idea. I wonder if the package maintainers would indicate under what conditions they would consider merging something like this, otherwise I can just start another package for it. |
Made a package for this, comments welcome: |
Since I needed a SortedVector type, here it is.
The hardest part was writing the binary search.
A simpler implementation would be to just to push the new element onto the end and then use
sort!
, but that seemed to be slightly slower in my not-extensive-nor-rigorous tests.