Add the inequalities to the leftjoin() and leftjoin!() functions. #85

kyo227 · 2022-09-08T14:46:29Z

I refer to the inequality that implementation on the on parameter in innerjoin() function, then implemented it on the leftjoin() and leftjoin!() function.
Now, we can use leftjoin like this: leftjoin(dsl,dsr,on = [1=>1, 2=>(:lower,nothing)]).
The methods used to find ranges and idx are the same as the methods in innerjoin, i just modified the way that uses ranges and idx to generate the result dataset in leftjoin().
I have done some testing, and this function works fine, and the run time and allocations are similar with innerjoin().

sl-solution · 2022-09-09T06:43:59Z

Please add tests for you functions. test/join.jl has a test set for range join, you may put your tests there.

…r) in LeftJoin, that the ranges and inbit do not match, so the result will have extra lines. Last version use the filter() function, which greatly affected the efficency, so i changed it.

kyo227 · 2022-09-12T15:59:52Z

I've added tests to test/join.jl and fixed some bugs.

sl-solution · 2022-09-13T06:25:30Z

This is a good PR, however, I have some concerns about the way you implemented it.
General feedback:

There are lots of code repeats
You should use innerjoin implementation. You may put all core computation of innerjoin in a new function which can handle both inner and left joins.
leftjoin and innerjoin are basically the same, except that innerjoin drops rows with no match (these rows have inbits = false)
There are no test for leftjoin! - the question is do we need range join for leftjoin!?
I added some comments for some parts of the code.

sl-solution · 2022-09-13T06:26:50Z

src/join/join.jl

+    cnt = 1
+    cnt_r = 1
+    if length(r2) == 0
+        _res = allowmissing(_res)


this shouldn't be here. If _res needs to support missing, it must be set when it is initialised.

sl-solution · 2022-09-13T06:27:28Z

src/join/join.jl

+        end
+        revised_ends = _mark_lt_part!(inbits, _columns(dsl)[left_range_col], _columns(dsr)[right_range_cols[2]], _fl, _fr, ranges, idx, new_ends, total_length < typemax(Int32) ? Val(Int32) : Val(Int64); strict = strict_inequality[2], threads = threads)   
+        inbit_ranges = Vector{UnitRange{T}}(undef, nrow(dsl))
+        ra = Vector{UnitRange{T}}(undef, nrow(dsl))


Why ra is needed?

ra is the lo:hi computed from the original new_ends, which corresponds to the inbits. For example, inbits = [1,0,1,0,0], ra = [1:1, 2:2, 3:4, 5:5], that means the 4th rows need to be deleted. And i modify new_ends later to get the right number of rows.
So i can use new_ends to calculate the lo:hi in _fill_right_cols_table_left!(). I have to calculate it in there.

if you calculate new_ends = map(x->max(length(x), 1), ranges) in innerjoin and when ranges is 1:0 the corresponding row in the left dataset be set as missing in the output then it should be fine. Additionally, when for a row all inbits are false the output dataset should produce missing. So, the point is with a little modification ofinnerjoin we can achieve left range join.

The ranges only check the left part of inequality(type = :left). For example, 2 => (:low, :high), the ranges is the result of 2 => (:low, nothing). So, maybe some rows shouldn't be in the result, but its ranges are not 1:0. So, the new_ends and the inbits can not find the correct number of rows for res. We need to calculate the lo2:hi2 to find the correct size of res. And inbits can just show how many rows can be matched, we don't know which rows don't match at all(these rows need to be add to result with missing).

you may use new_ends = map(x->max(length(x), 1), ranges) and exploits inbits to figure out which rows do not have matching rows in the right data set.

After we get the number of rows in result, the method in innerjoin use the r2.start(r2 is lo2:hi2) as the index of _res. But lo2:hi2 do not have the missing rows.

The _res[2] should be missing, but if we use the method in innerjoin, the _res[2] will be the 3rd rows.

So, in my code, i calculate the new new_ends as the index of _res. but the old new_ends is also needed, so ra is the old new_ends .

I see.
However, I am thinking something like this:
Suppose we have (:lower, :upper) condition and ranges are [2:3,1:0,4:5,1:0] so we creates new_ends as cumsum([2,1,2,1]) and inbits is [0,0,0,0,0,0]. _mark_lt_part! will works on inbits in locations [1:2, 4:5]. Suppose that the first one, i.e. 1:2 does not have any matching rows but the second one has one. So _mark_lt_part! will set the first element of ranges to 1:0 and inbits will be something like [0,0,0,0,1,0]. At this moment we know that ragnes with values 1:0 only exist in the left data set and inbits will help us to figure out the matching rows in the right data set.

But it is hard to find the correct location of _res by that way. Because in a single loop for one row, we only know the ranges, lo:hi and lo2:hi2 for that row. We don't know how many rows in previous rows because of multithreading. For your example, ranges is [1:0, 1:0, 4:5, 1:0], for 3rd rows, we just know lo:hi is 4:5, lo2:hi2 is 1:1. We can not know where the data in 3rd row goes in the _res.
And this is why I calculated lo:hi and lo2:hi2 in fornt of the loop of rows, and use map(x -> max(1, length(x)), inbit_ranges) to get the starting position of each row.

I see your point, however, new_ends shows us exactly which rows of inbits belongs to a specific row in the left table, e.g. in the example above new_ends = [2,3,5,6], thus the first row in the left table has two rows in inbits, additionally we already know that row 1 is 1:0 and it only creates one row in the output data set, and so on.

sl-solution · 2022-09-13T06:27:56Z

src/join/join.jl

+        revised_ends = _mark_lt_part!(inbits, _columns(dsl)[left_range_col], _columns(dsr)[right_range_cols[2]], _fl, _fr, ranges, idx, new_ends, total_length < typemax(Int32) ? Val(Int32) : Val(Int64); strict = strict_inequality[2], threads = threads)   
+        inbit_ranges = Vector{UnitRange{T}}(undef, nrow(dsl))
+        ra = Vector{UnitRange{T}}(undef, nrow(dsl))
+        @_threadsfor threads for i in 1:length(ranges)


put the core computations inside a separate function - it helps reducing allocations

Not sure if we need this at all.

src/join/join.jl

kyo227 · 2022-09-13T07:35:56Z

Thank you for your advice. It seems that my code still has some problems. In this part, I forgot to change this function. So ,the test will not pass.

…s the test of test/join.jl. leftjoin!() may be not necessary to implement the range join function, so i didn't change it. I write a function named _ranges_join(), which can implement the leftjoin and innerjoin. _ranges_join() use the parameter join_type to distinguished the leftjoin and innerjoin.

kyo227 · 2022-09-20T14:31:08Z

I made changes to the implementation of range join in leftjoin and it passed the test in test/join.jl.

kyo227 · 2022-09-22T05:22:44Z

Sorry, I found some problems in my code. This code is not ready yet.

kyo227 added 2 commits September 8, 2022 21:51

The inequalities in innerjoin are added to the leftjoin

1f1f93b

fix some bug

10331b2

kyo227 added 5 commits September 10, 2022 13:49

Fix some bugs for 1 => (:lower, :higher) situation.

1a203a1

I used a different method to deal with the problem of (:lower, :highe…

baf921f

…r) in LeftJoin, that the ranges and inbit do not match, so the result will have extra lines. Last version use the filter() function, which greatly affected the efficency, so i changed it.

Merge branch 'sl-solution:master' into master

9d07cee

Fix some bugs, and finish the test of leftjoin in join.jl.

c257c26

Merge branch 'master' of github.com:kyo227/InMemoryDatasets.jl

6d220e3

sl-solution reviewed Sep 13, 2022

View reviewed changes

kyo227 added 2 commits September 13, 2022 16:32

Change the _join_inner_dict to _join_left_dict.

f61ae75

Fix some problems

d12033b

sl-solution mentioned this pull request May 16, 2023

LEFT ASOF JOIN #108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the inequalities to the leftjoin() and leftjoin!() functions. #85

Add the inequalities to the leftjoin() and leftjoin!() functions. #85

kyo227 commented Sep 8, 2022

sl-solution commented Sep 9, 2022

kyo227 commented Sep 12, 2022

sl-solution commented Sep 13, 2022

sl-solution Sep 13, 2022

sl-solution Sep 13, 2022

kyo227 Sep 13, 2022

sl-solution Sep 13, 2022 •

edited

Loading

kyo227 Sep 13, 2022 •

edited

Loading

sl-solution Sep 13, 2022

kyo227 Sep 13, 2022

sl-solution Sep 14, 2022

kyo227 Sep 14, 2022

sl-solution Sep 17, 2022

sl-solution Sep 13, 2022

sl-solution Sep 13, 2022

kyo227 commented Sep 13, 2022

kyo227 commented Sep 20, 2022

kyo227 commented Sep 22, 2022

Add the inequalities to the leftjoin() and leftjoin!() functions. #85

Are you sure you want to change the base?

Add the inequalities to the leftjoin() and leftjoin!() functions. #85

Conversation

kyo227 commented Sep 8, 2022

sl-solution commented Sep 9, 2022

kyo227 commented Sep 12, 2022

sl-solution commented Sep 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sl-solution Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

kyo227 Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyo227 commented Sep 13, 2022

kyo227 commented Sep 20, 2022

kyo227 commented Sep 22, 2022

sl-solution Sep 13, 2022 •

edited

Loading

kyo227 Sep 13, 2022 •

edited

Loading