Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SortedVector type and tests #290

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dpsanders
Copy link
Contributor

@dpsanders dpsanders commented Jun 1, 2017

Since I needed a SortedVector type, here it is.

The hardest part was writing the binary search.
A simpler implementation would be to just to push the new element onto the end and then use sort!, but that seemed to be slightly slower in my not-extensive-nor-rigorous tests.

@StephenVavasis
Copy link
Contributor

This functionality is already available in the SortedSet container. Furthermore, SortedSet will in general be more efficient because insertion and deletion to SortedSet require O(log n) operations, whereas insertion and deletion in SortedVector require O(n) operations.

In addition, Base already contains the functions: issorted, searchsorted_first, and searchsorted_last to deal with sorted vectors.

If you believe that, despite what is already available. there is a use-case for the new SortedVector, then possibly you could document the use-case. Also, you might consider following the same interface provided by SortedSet and the other sorted containers. For SortedSet, the second type parameter is an order object rather than a function.

@dpsanders
Copy link
Contributor Author

dpsanders commented Jun 1, 2017

Thanks, I see your point.

One possible answer is the overhead for storing the set, e.g.

julia> s = SortedSet([7, 5, 1])
SortedSet([1, 5, 7],
Base.Order.ForwardOrdering())

julia> Base.summarysize(s)
424

julia> v = SortedVector([7, 5, 1])
SortedVector([1, 5, 7])

julia> Base.summarysize(v)
32

I understand that the rotations etc. needed to balance the set can be expensive, but it's not very clear to me how to benchmark things like that which modify the structure being tested.

@dpsanders
Copy link
Contributor Author

Also, in my application, I will need to have multiple values that compare as being equal (with a non-standard comparison that only compares the first element of a tuple). This doesn't seem to be possible using SortedSet.

@StephenVavasis
Copy link
Contributor

My previous statement that SortedSet implements the same functionality as SortedVector is incorrect-- SortedSet{K} does not allow multiple copies of the same data item whereas SortedVector{K} allows multiplicities. Rather, SortedVector{K} could be implemented as SortedDict{K,Int}, where the Int value field contains the multiplicity.

If you'd like a reasonable benchmark for a sorted set that includes insertion and/or deletion, you could code up the Sieve of Eratosthenes algorithm for finding all the prime numbers between 1 and, say, 10^7 to compare SortedSet to SortedVector.

@dpsanders
Copy link
Contributor Author

I can't see from the docs how to use a non-standard ordering, e.g. x->x[1] for a tuple.

@StephenVavasis
Copy link
Contributor

If you want a structure that will preserve multiple insertions of keys that are distinct under == but test as equals under the sort order, then your best bet would be SortedMultiDict{K,Void}. The SortedMultiDict container allows multiple keys that test as equal by the sort order.

You can implement a test based on the first entry either using the Lt type from Base
(use Lt((x,y) -> x[1] < y[1]) as the order object), or implement this as your own type -- something like the following (untested)

immutable OrderByFirstEntry <: Ordering
end

eq(::OrderByFirstEntry, a, b) = a[1] == b[1]
lt(::OrderByFirstEntry, a, b) = a[1] < b[1]

and then use s = SortedMultiDict(K,Void,OrderByFirstEntry)(OrderByFirstEntry()) to create the initially empty object.

@dpsanders
Copy link
Contributor Author

Thanks. One of the reasons I need a sorted vector or related is that I will frequently need to remove all of those entries that are larger than a certain value. Is that functionality available / easy to implement in SortedMultiDict? It feels like it should just corresponding to lopping off a branch from the tree and doing a rebalance.

@StephenVavasis
Copy link
Contributor

In principle, it is possible to chop off an entire subtree of the tree and then rebalance, but it would be a lot of code to write. (Have a look at the delete! function in balanced_tree,jl.) Depending how often you need to carry out this operation, a Vector solution might be better than SortedMultiDict.

@dpsanders
Copy link
Contributor Author

OK thanks. I guess that would be a high complexity operation? (O(number deleted items) or similar?)

@StephenVavasis
Copy link
Contributor

Currently, to delete k items from a sorted multidict would require O(k*log n) operations. If new special-purpose code were written for deleting a block of k consecutive items, then I think this would require O(k + log n) operations. To delete the last k items from a vector requires only O(1) time (since it corresponds to resize).

@kmsquire
Copy link
Member

A simpler implementation would be to just to push the new element onto the end and then use sort!, but that seemed to be slightly slower in my not-extensive-nor-rigorous tests.

Just for curiosity: if you run those tests, but use TimSort from SortingAlgorithms.jl, is that faster? That algorithm targets "mostly sorted" lists.

@dpsanders
Copy link
Contributor Author

I guess the slow part is moving the data around, which TimSort would also (presumably) end up doing.

SortedVectors seem to be working well in my application. It seems a shame to just keep this functionality inside my package, when DataStructures.jl is the obvious home for it, so any objection to merging this?

@codecov
Copy link

codecov bot commented Jun 14, 2017

Codecov Report

Merging #290 into master will decrease coverage by 0.53%.
The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #290      +/-   ##
==========================================
- Coverage    95.9%   95.36%   -0.54%     
==========================================
  Files          31       32       +1     
  Lines        2244     2265      +21     
==========================================
+ Hits         2152     2160       +8     
- Misses         92      105      +13
Impacted Files Coverage Δ
src/DataStructures.jl 80% <ø> (-20%) ⬇️
src/sorted_vector.jl 33.33% <33.33%> (ø)
src/ordered_dict.jl 87.02% <0%> (+0.05%) ⬆️
src/priorityqueue.jl 88.76% <0%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 161792b...ecb7eb3. Read the comment docs.

Copy link
Member

@kmsquire kmsquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few comments about a couple of minor issues.

There has been some discussion about having the output of sort() be a SortedVector type, so this is a nice start to testing that idea!

Depending on how it's used, there are probably more efficient representations as the size grows large (e.g., something like the blocked implementation of Deques here).

Anyway, after fixing the issues I mentioned, I think this is nice to have. Improvements could come as someone needs them or has time to implement them.

by::F

function SortedVector(data::Vector{T}, by::F)
new(sort(data), by)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to pass by to sort() here.

Should also pass through rev and alg as parameters to sort().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And for that matter, rev could/should probably be a separate field here as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks.

rev seems to be redundant with by.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rev isn't redundant with by: sort takes both arguments. Would be better to make them keyword arguments, to match sort.

It would also be useful to provide a way to sort the data in-place, and to declare that the input is already sorted. Not sure what's the best way to do this; maybe just a sorted keyword argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the constructor shouldn't call sort. You end up needing SortedVector(v, dont_sort=true), instead of simply writing SortedVector(v), SortedVector(sort(v)), SortedVector(sort!(v)) as needed.

To support different orderings, SortedVector should have a Base.Order.Ordering type parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the constructor doesn't call sort, you can end up with a non-sorted SortedVector. Or am I missing something? Or just call issorted first to see if it needs sorting? (But is that much faster than just calling sort if it's already sorted?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand the point. But it should definitely not be possible to create the object without any sorting at all.



getindex(v::SortedVector, i::Int) = v.data[i]
length(v::SortedVector) = length(v.data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These and a few functions below can be simplified using @delegate

push!(v, (1, "x"))
@test v.data == [(1, "x"), (2, "b"), (3, "z"), (4, "y"), (5, "a"), (6, "z")]

end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a testset using

v = SortedVector([1,2,3], x->-x)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also test all methods defined for the type. Can you also test that an error is thrown when calling setindex!?

@kmsquire
Copy link
Member

Cc: @nalimilan

@nalimilan
Copy link

Thanks for Ccing me, @kmsquire. Indeed SortedVector would be very useful to allow getting rid of the searchsorted* functions in favor of dedicated methods for the generic find/search functions. There seems to be some support by core devs about this.

However, I'm not sure it would be OK to deprecate searchsorted* functions in Base and require people to use DataStructures. Since essential features like sorted dicts are also in the package, it may be fine. If not, SortedVector would have to be included in Base. That's not a lot of code, but the trend has generally been to shrink Base as much as possible.

Copy link

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be useful to define issorted, IndexStyle, and maybe other functions.

by::F

function SortedVector(data::Vector{T}, by::F)
new(sort(data), by)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rev isn't redundant with by: sort takes both arguments. Would be better to make them keyword arguments, to match sort.

It would also be useful to provide a way to sort the data in-place, and to declare that the input is already sorted. Not sure what's the best way to do this; maybe just a sorted keyword argument.

A `SortedVector` behaves like a standard Julia `Vector`, except that its elements are stored in sorted order, with an optional function `by` that determines the sorting order in the same way as `Base.searchsorted`.
"""
immutable SortedVector{T, F<:Function}
data::Vector{T}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use AbstractVector everywhere (here, via a type parameter).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for using an AbstractVector?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, one could use a DataArray, a CategoricalArray... Is there any reason not to support it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was just worried that there are some subtypes of AbstractVector that wouldn't support the necessary operations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT you only use functions in the AbstractArray interface, right? Anyway, even if a method fails for some peculiar types, it cannot be worse than if you only accept Vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, there will be a big performance hit for making the inner container an AbstractVector. The compiler won't be able to ascertain the actual type of the inner container at compile-time so there will be a lot more run-time dispatching. To avoid this problem, you would need to make both the data type and the inner container type parameters of SortedVector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the container type a parameter is what @nalimilan was suggesting.

@@ -0,0 +1,52 @@
"""
A `SortedVector` behaves like a standard Julia `Vector`, except that its elements are stored in sorted order, with an optional function `by` that determines the sorting order in the same way as `Base.searchsorted`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap at 92 chars. Also, better refer to sort than to searchsorted. The docstring should also mention explicitly what happens on insert.

length(v::SortedVector) = length(v.data)


function insert!{T}(v::SortedVector{T}, i::Int, x::T)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't use ::T, just leave push! do the conversion automatically if possible. Same below.

Though this method should probably throw an error, since it doesn't respect the contract of inserting x at position i.



function insert!{T}(v::SortedVector{T}, i::Int, x::T)
push!(v.data, i, x)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i is incorrect here, right?

@@ -102,6 +104,8 @@ module DataStructures

include("priorityqueue.jl")

include("sorted_vector.jl")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No underscore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look at the other file names in the package.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, my comment was based on priorityqueue.jl, which is the only example shown in the diff, and apparently also the only exception.

push!(v, (1, "x"))
@test v.data == [(1, "x"), (2, "b"), (3, "z"), (4, "y"), (5, "a"), (6, "z")]

end

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also test all methods defined for the type. Can you also test that an error is thrown when calling setindex!?



SortedVector{T,F}(data::Vector{T}, by::F) = SortedVector{T,F}(data, by)
SortedVector{T}(data::Vector{T}) = SortedVector{T,typeof(identity)}(data, identity)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get rid of this function by adding by=identity to the previous one.


function push!{T}(v::SortedVector{T}, x::T)
i = searchsortedfirst(v.data, x, by=v.by)
insert!(v.data, i, x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should just check that the pushed item is greater than the current last item in the vector. Otherwise it doesn't implement the contract that pop! returns the last thing you push!ed.

@dpsanders
Copy link
Contributor Author

The documentation for pop! says that it returns the last element in the structure, not the last element added to the structure.

To me push! just means "add this element to the data structure"; I don't see why it should be in the last place.

@dpsanders
Copy link
Contributor Author

Many thanks for the detailed review @nalimilan. I don't think I'll have time to do all that any time soon though.

@kmsquire
Copy link
Member

@nalimilan said

However, I'm not sure it would be OK to deprecate searchsorted* functions in Base and require people to use DataStructures. Since essential features like sorted dicts are also in the package, it may be fine. If not, SortedVector would have to be included in Base. That's not a lot of code, but the trend has generally been to shrink Base as much as possible.

I agree. I had a longer response with a more detailed proposal for what specifically to do with searchsorted* (in a type stable way) ready for JuliaLang/julia#22354 (comment), but I thought it was getting too far from the topic there. I'll open an issue in julialang.

@kmsquire
Copy link
Member

kmsquire commented Sep 2, 2017

bump

@nalimilan
Copy link

Actually, searchsorted functions are used in Base for sparse matrices, so the replacement will have to be defined in Base too. Though we could use a very simpler wrapper called e.g. Sorted which would work with any iterable, but be read-only. SortedVector could live in DataStructures and support more operations.

@dpsanders
Copy link
Contributor Author

Sorry, dropped the ball on this one.
If anybody would like to finish this off, feel free...

@tpapp
Copy link

tpapp commented Nov 19, 2018

I am interested in reviving this, since it would be nice to have a vector-like container that simply indicates that the contents are sorted.

However, from the long discussion above it seems there was some resistance to the idea. I wonder if the package maintainers would indicate under what conditions they would consider merging something like this, otherwise I can just start another package for it.

@tpapp
Copy link

tpapp commented Nov 20, 2018

Made a package for this, comments welcome:
https://github.com/tpapp/SortedVectors.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants