-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LEFT ASOF JOIN #108
Comments
We currently support asof join, please see |
I think it is faster than the R data.table implementations. using Dates, InMemoryDatasets, BenchmarkTools
start_ = Dates.DateTime("2020-01-01T00:00:00")
end_ = Dates.DateTime("2020-06-20T01:00:00")
d = start_:Dates.Minute(1):end_
# Every minute
calendar = InMemoryDatasets.Dataset(date=collect(d))
N = Int(1e2)
data = InMemoryDatasets.Dataset(date=unique(start_ .+ Dates.Minute.(trunc.(range(10000 , Dates.Minute(end_ - start_).value, N)))), value=range(0 , 1, N))
@benchmark InMemoryDatasets.closejoin(calendar, data, on=:date)
BenchmarkTools.Trial: 240 samples with 1 evaluation.
Range (min … max): 15.185 ms … 29.085 ms ┊ GC (min … max): 0.00% … 11.99%
Time (median): 20.594 ms ┊ GC (median): 0.00%
Time (mean ± σ): 20.872 ms ± 2.743 ms ┊ GC (mean ± σ): 1.86% ± 4.40%
In R: start <- as.POSIXct("2020-01-01")
end <- as.POSIXct("2021-01-01")
# Every minute
calendar <- data.frame(date = seq(from = start, to = end, by = "+1 min"))
N <- 1e3
data <- data.frame(date = start + as.difftime(seq(from=10000, to=NROW(calendar), length.out=N) , units = "mins"), value = seq(0,1,length.out=N))
# xts
calendar.xts <- xts::xts(, order.by = calendar[["date"]])
data.xts <- xts::xts(data[["value"]], order.by = data[["date"]])
# data.table
data.table::setDTthreads(1)
data.dt <- data.table::data.table(data , key = 'date')
calendar.dt = data.table::data.table(calendar, key = 'date')
# tidytable (data.table)
calendar.tt = tidytable::as_tidytable(calendar)
data.tt = tidytable::as_tidytable(data)
microbenchmark::microbenchmark(
merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)), join = "left"),
data.dt[calendar.dt, on = "date", roll = TRUE],
dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >= date))),
dplyr::left_join(calendar.tt, data.tt, by = dplyr::join_by(closest(date >= date))),
times = 100
)
Unit: milliseconds
expr min lq mean median uq max neval
merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)), join = "left") 11.76196 12.40797 16.29457 13.94297 15.52498 45.91311 100
data.dt[calendar.dt, on = "date", roll = TRUE] 49.76514 52.01861 57.98249 54.69226 57.39686 105.48279 100
dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >= date))) 58.75115 62.68137 70.79821 64.63340 71.47450 156.71143 100
dplyr::left_join(calendar.tt, data.tt, by = dplyr::join_by(closest(date >= date))) 58.43832 62.22406 72.83867 65.62614 84.72843 114.87093 100 |
Since you are missing some points, I like to elaborate on some issues about the benchmarks posted on https://bwlewis.github.io/duckdb_and_r/asof/asof.html. The cited web site is misleading about the performance of packages for doing close match joins. The misleading part is that packages like
Regarding your reported benchmark, the same problem exists, because providing extra information about the right data set using To give some idea about the performance (although for 1e3 rows this doesn't make that much difference), I reran your code with this extra information, and the mean value for the fastest R option is about using Pkg
Pkg.update()
using Dates, InMemoryDatasets, BenchmarkTools
start_ = Dates.DateTime("2020-01-01T00:00:00")
end_ = Dates.DateTime("2020-06-20T01:00:00")
d = start_:Dates.Minute(1):end_
# Every minute
calendar = InMemoryDatasets.Dataset(date=collect(d))
N = Int(1e3)
data = InMemoryDatasets.Dataset(date=unique(start_ .+ Dates.Minute.(trunc.(range(10000 , Dates.Minute(end_ - start_).value, N)))), value=range(0 , 1, N))
issorted!(data, :date)
@benchmark InMemoryDatasets.closejoin(calendar, data, on=:date) |
I was wondering if you would consider a left asof join, which is similar to the current close join. I don't believe its possible to do a left inexact join yet,
Also possibly some we could benchmark it against the other major packages (pandas merge_asof, data.table, polars). https://bwlewis.github.io/duckdb_and_r/asof/asof.html
The text was updated successfully, but these errors were encountered: