-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the inequalities to the leftjoin() and leftjoin!() functions. #85
base: master
Are you sure you want to change the base?
Conversation
Please add tests for you functions. |
…r) in LeftJoin, that the ranges and inbit do not match, so the result will have extra lines. Last version use the filter() function, which greatly affected the efficency, so i changed it.
I've added tests to |
This is a good PR, however, I have some concerns about the way you implemented it.
|
src/join/join.jl
Outdated
cnt = 1 | ||
cnt_r = 1 | ||
if length(r2) == 0 | ||
_res = allowmissing(_res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be here. If _res
needs to support missing
, it must be set when it is initialised.
src/join/join.jl
Outdated
end | ||
revised_ends = _mark_lt_part!(inbits, _columns(dsl)[left_range_col], _columns(dsr)[right_range_cols[2]], _fl, _fr, ranges, idx, new_ends, total_length < typemax(Int32) ? Val(Int32) : Val(Int64); strict = strict_inequality[2], threads = threads) | ||
inbit_ranges = Vector{UnitRange{T}}(undef, nrow(dsl)) | ||
ra = Vector{UnitRange{T}}(undef, nrow(dsl)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ra
is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ra is the lo:hi computed from the original new_ends, which corresponds to the inbits. For example, inbits = [1,0,1,0,0], ra = [1:1, 2:2, 3:4, 5:5], that means the 4th rows need to be deleted. And i modify new_ends later to get the right number of rows.
So i can use new_ends to calculate the lo:hi in _fill_right_cols_table_left!(). I have to calculate it in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you calculate new_ends = map(x->max(length(x), 1), ranges)
in innerjoin
and when ranges
is 1:0
the corresponding row in the left dataset be set as missing in the output then it should be fine. Additionally, when for a row all inbits
are false
the output dataset should produce missing
. So, the point is with a little modification ofinnerjoin
we can achieve left range join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ranges
only check the left part of inequality(type = :left
). For example, 2 => (:low, :high)
, the ranges is the result of 2 => (:low, nothing)
. So, maybe some rows shouldn't be in the result, but its ranges
are not 1:0
. So, the new_ends
and the inbits
can not find the correct number of rows for res
. We need to calculate the lo2:hi2
to find the correct size of res
. And inbits
can just show how many rows can be matched, we don't know which rows don't match at all(these rows need to be add to result with missing
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may use new_ends = map(x->max(length(x), 1), ranges)
and exploits inbits
to figure out which rows do not have matching rows in the right data set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After we get the number of rows in result, the method in innerjoin
use the r2.start
(r2 is lo2:hi2) as the index of _res
. But lo2:hi2
do not have the missing rows.
The _res[2]
should be missing, but if we use the method in innerjoin
, the _res[2]
will be the 3rd rows.
So, in my code, i calculate the new new_ends
as the index of _res
. but the old new_ends
is also needed, so ra
is the old new_ends
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
However, I am thinking something like this:
Suppose we have (:lower, :upper) condition and ranges
are [2:3,1:0,4:5,1:0]
so we creates new_ends
as cumsum([2,1,2,1])
and inbits
is [0,0,0,0,0,0]
. _mark_lt_part!
will works on inbits
in locations [1:2, 4:5]
. Suppose that the first one, i.e. 1:2
does not have any matching rows but the second one has one. So _mark_lt_part!
will set the first element of ranges
to 1:0
and inbits
will be something like [0,0,0,0,1,0]
. At this moment we know that ragnes
with values 1:0
only exist in the left data set and inbits
will help us to figure out the matching rows in the right data set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it is hard to find the correct location of _res
by that way. Because in a single loop for one row, we only know the ranges
, lo:hi
and lo2:hi2
for that row. We don't know how many rows in previous rows because of multithreading. For your example, ranges
is [1:0, 1:0, 4:5, 1:0]
, for 3rd rows, we just know lo:hi
is 4:5
, lo2:hi2
is 1:1
. We can not know where the data in 3rd row goes in the _res
.
And this is why I calculated lo:hi
and lo2:hi2
in fornt of the loop of rows, and use map(x -> max(1, length(x)), inbit_ranges)
to get the starting position of each row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, however, new_ends
shows us exactly which rows of inbits
belongs to a specific row in the left table, e.g. in the example above new_ends = [2,3,5,6]
, thus the first row in the left table has two rows in inbits
, additionally we already know that row 1 is 1:0
and it only creates one row in the output data set, and so on.
src/join/join.jl
Outdated
revised_ends = _mark_lt_part!(inbits, _columns(dsl)[left_range_col], _columns(dsr)[right_range_cols[2]], _fl, _fr, ranges, idx, new_ends, total_length < typemax(Int32) ? Val(Int32) : Val(Int64); strict = strict_inequality[2], threads = threads) | ||
inbit_ranges = Vector{UnitRange{T}}(undef, nrow(dsl)) | ||
ra = Vector{UnitRange{T}}(undef, nrow(dsl)) | ||
@_threadsfor threads for i in 1:length(ranges) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the core computations inside a separate function - it helps reducing allocations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need this at all.
…s the test of test/join.jl. leftjoin!() may be not necessary to implement the range join function, so i didn't change it. I write a function named _ranges_join(), which can implement the leftjoin and innerjoin. _ranges_join() use the parameter join_type to distinguished the leftjoin and innerjoin.
I made changes to the implementation of range join in leftjoin and it passed the test in test/join.jl. |
Sorry, I found some problems in my code. This code is not ready yet. |
I refer to the inequality that implementation on the on parameter in innerjoin() function, then implemented it on the leftjoin() and leftjoin!() function.
Now, we can use leftjoin like this: leftjoin(dsl,dsr,on = [1=>1, 2=>(:lower,nothing)]).
The methods used to find ranges and idx are the same as the methods in innerjoin, i just modified the way that uses ranges and idx to generate the result dataset in leftjoin().
I have done some testing, and this function works fine, and the run time and allocations are similar with innerjoin().