LEFT ASOF JOIN #108

era127 · 2023-05-16T20:16:06Z

I was wondering if you would consider a left asof join, which is similar to the current close join. I don't believe its possible to do a left inexact join yet,

Also possibly some we could benchmark it against the other major packages (pandas merge_asof, data.table, polars). https://bwlewis.github.io/duckdb_and_r/asof/asof.html

sl-solution · 2023-05-16T20:36:36Z

~~#85 is addressing this issue.~~
Sorry, I mixed it with inequality joins,

We currently support asof join, please see closejoin! and closejoin, the former one is in-place version.

era127 · 2023-05-17T02:29:44Z

I think it is faster than the R data.table implementations.

using Dates, InMemoryDatasets, BenchmarkTools

start_ = Dates.DateTime("2020-01-01T00:00:00")
end_ = Dates.DateTime("2020-06-20T01:00:00")
d = start_:Dates.Minute(1):end_

# Every minute
calendar = InMemoryDatasets.Dataset(date=collect(d))
N = Int(1e2)
data = InMemoryDatasets.Dataset(date=unique(start_ .+ Dates.Minute.(trunc.(range(10000 , Dates.Minute(end_ - start_).value, N)))), value=range(0 , 1, N))

@benchmark InMemoryDatasets.closejoin(calendar, data, on=:date)
BenchmarkTools.Trial: 240 samples with 1 evaluation.
 Range (min … max):  15.185 ms … 29.085 ms  ┊ GC (min … max): 0.00% … 11.99%
 Time  (median):     20.594 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.872 ms ±  2.743 ms  ┊ GC (mean ± σ):  1.86% ±  4.40%

In R:

start <- as.POSIXct("2020-01-01")
end <- as.POSIXct("2021-01-01")
# Every minute
calendar <- data.frame(date = seq(from = start, to = end, by = "+1 min"))
N <- 1e3
data <- data.frame(date = start + as.difftime(seq(from=10000, to=NROW(calendar), length.out=N)  , units = "mins"), value = seq(0,1,length.out=N))

# xts
calendar.xts <- xts::xts(, order.by = calendar[["date"]])
data.xts <- xts::xts(data[["value"]], order.by = data[["date"]])
# data.table
data.table::setDTthreads(1)
data.dt <- data.table::data.table(data , key = 'date')
calendar.dt = data.table::data.table(calendar, key = 'date')
# tidytable (data.table)
calendar.tt = tidytable::as_tidytable(calendar)
data.tt = tidytable::as_tidytable(data)

microbenchmark::microbenchmark(
  merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)), join = "left"),
  data.dt[calendar.dt, on = "date", roll = TRUE],
  dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >= date))),
  dplyr::left_join(calendar.tt, data.tt, by = dplyr::join_by(closest(date >= date))),
  times = 100
)

Unit: milliseconds
                                                                                    expr      min       lq     mean   median       uq       max neval
    merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)),      join = "left") 11.76196 12.40797 16.29457 13.94297 15.52498  45.91311   100
                                          data.dt[calendar.dt, on = "date", roll = TRUE] 49.76514 52.01861 57.98249 54.69226 57.39686 105.48279   100
       dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >=      date))) 58.75115 62.68137 70.79821 64.63340 71.47450 156.71143   100
 dplyr::left_join(calendar.tt, data.tt, by = dplyr::join_by(closest(date >=      date))) 58.43832 62.22406 72.83867 65.62614 84.72843 114.87093   100

sl-solution · 2023-05-18T06:30:53Z

Since you are missing some points, I like to elaborate on some issues about the benchmarks posted on https://bwlewis.github.io/duckdb_and_r/asof/asof.html.

The cited web site is misleading about the performance of packages for doing close match joins. The misleading part is that packages like polars need the data sets be sorted by keys, and this information is very critical for close match join, since the bottleneck in those tests is sorting the right data set (data).

IMD is not only fast but very efficient for performing close join. Actually if the right data set is already sorted, IMD is about 10 times faster (is about 0.028 second) than polars in those benchmarks (ran on my 2019 mac with 16GB RAM).

Regarding your reported benchmark, the same problem exists, because providing extra information about the right data set using setKey or orderby is the same as providing the critical information to the package. To achieve this in IMD you can use issorted!(data, :date).

To give some idea about the performance (although for 1e3 rows this doesn't make that much difference), I reran your code with this extra information, and the mean value for the fastest R option is about 0.030 second and this is about 0.003 second in IMD.

using Pkg
Pkg.update()
using Dates, InMemoryDatasets, BenchmarkTools

start_ = Dates.DateTime("2020-01-01T00:00:00")
end_ = Dates.DateTime("2020-06-20T01:00:00")
d = start_:Dates.Minute(1):end_

# Every minute
calendar = InMemoryDatasets.Dataset(date=collect(d))
N = Int(1e3)
data = InMemoryDatasets.Dataset(date=unique(start_ .+ Dates.Minute.(trunc.(range(10000 , Dates.Minute(end_ - start_).value, N)))), value=range(0 , 1, N))

issorted!(data, :date)

@benchmark InMemoryDatasets.closejoin(calendar, data, on=:date)

sl-solution added enhancement New feature or request and removed enhancement New feature or request labels May 16, 2023

era127 closed this as completed May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LEFT ASOF JOIN #108

LEFT ASOF JOIN #108

era127 commented May 16, 2023

sl-solution commented May 16, 2023 •

edited

Loading

era127 commented May 17, 2023

sl-solution commented May 18, 2023

LEFT ASOF JOIN #108

LEFT ASOF JOIN #108

Comments

era127 commented May 16, 2023

sl-solution commented May 16, 2023 • edited Loading

era127 commented May 17, 2023

sl-solution commented May 18, 2023

sl-solution commented May 16, 2023 •

edited

Loading