Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LEFT ASOF JOIN #108

Closed
era127 opened this issue May 16, 2023 · 3 comments
Closed

LEFT ASOF JOIN #108

era127 opened this issue May 16, 2023 · 3 comments

Comments

@era127
Copy link

era127 commented May 16, 2023

I was wondering if you would consider a left asof join, which is similar to the current close join. I don't believe its possible to do a left inexact join yet,

Also possibly some we could benchmark it against the other major packages (pandas merge_asof, data.table, polars). https://bwlewis.github.io/duckdb_and_r/asof/asof.html

@sl-solution
Copy link
Owner

sl-solution commented May 16, 2023

#85 is addressing this issue.
Sorry, I mixed it with inequality joins,

We currently support asof join, please see closejoin! and closejoin, the former one is in-place version.

@sl-solution sl-solution added enhancement New feature or request and removed enhancement New feature or request labels May 16, 2023
@era127
Copy link
Author

era127 commented May 17, 2023

I think it is faster than the R data.table implementations.

using Dates, InMemoryDatasets, BenchmarkTools

start_ = Dates.DateTime("2020-01-01T00:00:00")
end_ = Dates.DateTime("2020-06-20T01:00:00")
d = start_:Dates.Minute(1):end_

# Every minute
calendar = InMemoryDatasets.Dataset(date=collect(d))
N = Int(1e2)
data = InMemoryDatasets.Dataset(date=unique(start_ .+ Dates.Minute.(trunc.(range(10000 , Dates.Minute(end_ - start_).value, N)))), value=range(0 , 1, N))

@benchmark InMemoryDatasets.closejoin(calendar, data, on=:date)
BenchmarkTools.Trial: 240 samples with 1 evaluation.
 Range (min  max):  15.185 ms  29.085 ms  ┊ GC (min  max): 0.00%  11.99%
 Time  (median):     20.594 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.872 ms ±  2.743 ms  ┊ GC (mean ± σ):  1.86% ±  4.40%  

In R:

start <- as.POSIXct("2020-01-01")
end <- as.POSIXct("2021-01-01")
# Every minute
calendar <- data.frame(date = seq(from = start, to = end, by = "+1 min"))
N <- 1e3
data <- data.frame(date = start + as.difftime(seq(from=10000, to=NROW(calendar), length.out=N)  , units = "mins"), value = seq(0,1,length.out=N))

# xts
calendar.xts <- xts::xts(, order.by = calendar[["date"]])
data.xts <- xts::xts(data[["value"]], order.by = data[["date"]])
# data.table
data.table::setDTthreads(1)
data.dt <- data.table::data.table(data , key = 'date')
calendar.dt = data.table::data.table(calendar, key = 'date')
# tidytable (data.table)
calendar.tt = tidytable::as_tidytable(calendar)
data.tt = tidytable::as_tidytable(data)

microbenchmark::microbenchmark(
  merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)), join = "left"),
  data.dt[calendar.dt, on = "date", roll = TRUE],
  dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >= date))),
  dplyr::left_join(calendar.tt, data.tt, by = dplyr::join_by(closest(date >= date))),
  times = 100
)

Unit: milliseconds
                                                                                    expr      min       lq     mean   median       uq       max neval
    merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)),      join = "left") 11.76196 12.40797 16.29457 13.94297 15.52498  45.91311   100
                                          data.dt[calendar.dt, on = "date", roll = TRUE] 49.76514 52.01861 57.98249 54.69226 57.39686 105.48279   100
       dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >=      date))) 58.75115 62.68137 70.79821 64.63340 71.47450 156.71143   100
 dplyr::left_join(calendar.tt, data.tt, by = dplyr::join_by(closest(date >=      date))) 58.43832 62.22406 72.83867 65.62614 84.72843 114.87093   100

@era127 era127 closed this as completed May 17, 2023
@sl-solution
Copy link
Owner

Since you are missing some points, I like to elaborate on some issues about the benchmarks posted on https://bwlewis.github.io/duckdb_and_r/asof/asof.html.

The cited web site is misleading about the performance of packages for doing close match joins. The misleading part is that packages like polars need the data sets be sorted by keys, and this information is very critical for close match join, since the bottleneck in those tests is sorting the right data set (data).

IMD is not only fast but very efficient for performing close join. Actually if the right data set is already sorted, IMD is about 10 times faster (is about 0.028 second) than polars in those benchmarks (ran on my 2019 mac with 16GB RAM).

Regarding your reported benchmark, the same problem exists, because providing extra information about the right data set using setKey or orderby is the same as providing the critical information to the package. To achieve this in IMD you can use issorted!(data, :date).

To give some idea about the performance (although for 1e3 rows this doesn't make that much difference), I reran your code with this extra information, and the mean value for the fastest R option is about 0.030 second and this is about 0.003 second in IMD.

using Pkg
Pkg.update()
using Dates, InMemoryDatasets, BenchmarkTools

start_ = Dates.DateTime("2020-01-01T00:00:00")
end_ = Dates.DateTime("2020-06-20T01:00:00")
d = start_:Dates.Minute(1):end_

# Every minute
calendar = InMemoryDatasets.Dataset(date=collect(d))
N = Int(1e3)
data = InMemoryDatasets.Dataset(date=unique(start_ .+ Dates.Minute.(trunc.(range(10000 , Dates.Minute(end_ - start_).value, N)))), value=range(0 , 1, N))

issorted!(data, :date)

@benchmark InMemoryDatasets.closejoin(calendar, data, on=:date)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants