Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the inequalities to the leftjoin() and leftjoin!() functions. #85

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

kyo227
Copy link

@kyo227 kyo227 commented Sep 8, 2022

I refer to the inequality that implementation on the on parameter in innerjoin() function, then implemented it on the leftjoin() and leftjoin!() function.
Now, we can use leftjoin like this: leftjoin(dsl,dsr,on = [1=>1, 2=>(:lower,nothing)]).
The methods used to find ranges and idx are the same as the methods in innerjoin, i just modified the way that uses ranges and idx to generate the result dataset in leftjoin().
I have done some testing, and this function works fine, and the run time and allocations are similar with innerjoin().

@sl-solution
Copy link
Owner

Please add tests for you functions. test/join.jl has a test set for range join, you may put your tests there.

@kyo227
Copy link
Author

kyo227 commented Sep 12, 2022

I've added tests to test/join.jl and fixed some bugs.

@sl-solution
Copy link
Owner

This is a good PR, however, I have some concerns about the way you implemented it.
General feedback:

  • There are lots of code repeats
  • You should use innerjoin implementation. You may put all core computation of innerjoin in a new function which can handle both inner and left joins.
  • leftjoin and innerjoin are basically the same, except that innerjoin drops rows with no match (these rows have inbits = false)
  • There are no test for leftjoin! - the question is do we need range join for leftjoin!?
    I added some comments for some parts of the code.

src/join/join.jl Outdated
cnt = 1
cnt_r = 1
if length(r2) == 0
_res = allowmissing(_res)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be here. If _res needs to support missing, it must be set when it is initialised.

src/join/join.jl Outdated
end
revised_ends = _mark_lt_part!(inbits, _columns(dsl)[left_range_col], _columns(dsr)[right_range_cols[2]], _fl, _fr, ranges, idx, new_ends, total_length < typemax(Int32) ? Val(Int32) : Val(Int64); strict = strict_inequality[2], threads = threads)
inbit_ranges = Vector{UnitRange{T}}(undef, nrow(dsl))
ra = Vector{UnitRange{T}}(undef, nrow(dsl))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ra is needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ra is the lo:hi computed from the original new_ends, which corresponds to the inbits. For example, inbits = [1,0,1,0,0], ra = [1:1, 2:2, 3:4, 5:5], that means the 4th rows need to be deleted. And i modify new_ends later to get the right number of rows.
So i can use new_ends to calculate the lo:hi in _fill_right_cols_table_left!(). I have to calculate it in there.

Copy link
Owner

@sl-solution sl-solution Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you calculate new_ends = map(x->max(length(x), 1), ranges) in innerjoin and when ranges is 1:0 the corresponding row in the left dataset be set as missing in the output then it should be fine. Additionally, when for a row all inbits are false the output dataset should produce missing. So, the point is with a little modification ofinnerjoin we can achieve left range join.

Copy link
Author

@kyo227 kyo227 Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ranges only check the left part of inequality(type = :left). For example, 2 => (:low, :high), the ranges is the result of 2 => (:low, nothing). So, maybe some rows shouldn't be in the result, but its ranges are not 1:0. So, the new_ends and the inbits can not find the correct number of rows for res. We need to calculate the lo2:hi2 to find the correct size of res. And inbits can just show how many rows can be matched, we don't know which rows don't match at all(these rows need to be add to result with missing).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may use new_ends = map(x->max(length(x), 1), ranges) and exploits inbits to figure out which rows do not have matching rows in the right data set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we get the number of rows in result, the method in innerjoin use the r2.start(r2 is lo2:hi2) as the index of _res. But lo2:hi2 do not have the missing rows.
image
The _res[2] should be missing, but if we use the method in innerjoin, the _res[2] will be the 3rd rows.
image
So, in my code, i calculate the new new_ends as the index of _res. but the old new_ends is also needed, so ra is the old new_ends .

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
However, I am thinking something like this:
Suppose we have (:lower, :upper) condition and ranges are [2:3,1:0,4:5,1:0] so we creates new_ends as cumsum([2,1,2,1]) and inbits is [0,0,0,0,0,0]. _mark_lt_part! will works on inbits in locations [1:2, 4:5]. Suppose that the first one, i.e. 1:2 does not have any matching rows but the second one has one. So _mark_lt_part! will set the first element of ranges to 1:0 and inbits will be something like [0,0,0,0,1,0]. At this moment we know that ragnes with values 1:0 only exist in the left data set and inbits will help us to figure out the matching rows in the right data set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is hard to find the correct location of _res by that way. Because in a single loop for one row, we only know the ranges, lo:hi and lo2:hi2 for that row. We don't know how many rows in previous rows because of multithreading. For your example, ranges is [1:0, 1:0, 4:5, 1:0], for 3rd rows, we just know lo:hi is 4:5, lo2:hi2 is 1:1. We can not know where the data in 3rd row goes in the _res.
And this is why I calculated lo:hi and lo2:hi2 in fornt of the loop of rows, and use map(x -> max(1, length(x)), inbit_ranges) to get the starting position of each row.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, however, new_ends shows us exactly which rows of inbits belongs to a specific row in the left table, e.g. in the example above new_ends = [2,3,5,6], thus the first row in the left table has two rows in inbits, additionally we already know that row 1 is 1:0 and it only creates one row in the output data set, and so on.

src/join/join.jl Outdated
revised_ends = _mark_lt_part!(inbits, _columns(dsl)[left_range_col], _columns(dsr)[right_range_cols[2]], _fl, _fr, ranges, idx, new_ends, total_length < typemax(Int32) ? Val(Int32) : Val(Int64); strict = strict_inequality[2], threads = threads)
inbit_ranges = Vector{UnitRange{T}}(undef, nrow(dsl))
ra = Vector{UnitRange{T}}(undef, nrow(dsl))
@_threadsfor threads for i in 1:length(ranges)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the core computations inside a separate function - it helps reducing allocations

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need this at all.

@kyo227
Copy link
Author

kyo227 commented Sep 13, 2022

Thank you for your advice. It seems that my code still has some problems. In this part, I forgot to change this function. So ,the test will not pass.
image

…s the test of test/join.jl.

leftjoin!() may be not necessary to implement the range join function, so i didn't change it.
I write a function named _ranges_join(), which can implement the leftjoin and innerjoin.
_ranges_join() use the parameter join_type to distinguished the leftjoin and innerjoin.
@kyo227
Copy link
Author

kyo227 commented Sep 20, 2022

I made changes to the implementation of range join in leftjoin and it passed the test in test/join.jl.

@kyo227
Copy link
Author

kyo227 commented Sep 22, 2022

Sorry, I found some problems in my code. This code is not ready yet.

@sl-solution sl-solution mentioned this pull request May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants