-
-
Notifications
You must be signed in to change notification settings - Fork 73
DTable groupby #275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DTable groupby #275
Conversation
One initial comment: I prefer |
Codecov Report
@@ Coverage Diff @@
## master #275 +/- ##
=======================================
Coverage 0.00% 0.00%
=======================================
Files 35 37 +2
Lines 2939 3077 +138
=======================================
- Misses 2939 3077 +138
Continue to review full report at Codecov.
|
groupby reduce example: julia> using Dagger, DataFrames, Arrow, OnlineStats
julia> d = DTable(Arrow.Table, "data/".*readdir("data"))
DTable with 100 partitions
Tabletype: unknown (use `tabletype!(::DTable)`)
julia> tabletype!(d)
NamedTuple
julia> g = Dagger.groupby(d, x->round(x.a, digits=1), chunksize=1_000_00);
julia> g
Dagger.GDTable(DTable with 1005 partitions
Tabletype: NamedTuple, nothing, Dict(0.3 => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 91, 92, 93, 94, 95, 96, 97, 98, 99, 100], 0.5 => [101, 102, 103, 104, 105, 106, 107, 108, 109, 110 … 191, 192, 193, 194, 195, 196, 197, 198, 199, 200], 0.1 => [201, 202, 203, 204, 205, 206, 207, 208, 209, 210 … 291, 292, 293, 294, 295, 296, 297, 298, 299, 300], 1.0 => [301, 302, 303, 304, 305, 306, 307, 308, 309, 310 … 342, 343, 344, 345, 346, 347, 348, 349, 350, 351], 0.7 => [352, 353, 354, 355, 356, 357, 358, 359, 360, 361 … 442, 443, 444, 445, 446, 447, 448, 449, 450, 451], 0.4 => [452, 453, 454, 455, 456, 457, 458, 459, 460, 461 … 542, 543, 544, 545, 546, 547, 548, 549, 550, 551], 0.0 => [552, 553, 554, 555, 556, 557, 558, 559, 560, 561 … 596, 597, 598, 599, 600, 601, 602, 603, 604, 605], 0.2 => [606, 607, 608, 609, 610, 611, 612, 613, 614, 615 … 696, 697, 698, 699, 700, 701, 702, 703, 704, 705], 0.9 => [706, 707, 708, 709, 710, 711, 712, 713, 714, 715 … 796, 797, 798, 799, 800, 801, 802, 803, 804, 805], 0.8 => [806, 807, 808, 809, 810, 811, 812, 813, 814, 815 … 896, 897, 898, 899, 900, 901, 902, 903, 904, 905]…))
julia> fetch(d, DataFrame)
100000000×4 DataFrame
Row │ a b c d
│ Float64 Float64 Float64 Float64
───────────┼─────────────────────────────────────────────
1 │ 0.75307 0.970277 0.659855 0.750826
2 │ 0.0998466 0.577389 0.8332 0.550904
3 │ 0.777338 0.886394 0.71068 0.348982
4 │ 0.272381 0.541937 0.696544 0.42361
5 │ 0.348195 0.840752 0.373067 0.290357
⋮ │ ⋮ ⋮ ⋮ ⋮
99999997 │ 0.0303199 0.307095 0.502027 0.0692123
99999998 │ 0.0931763 0.3375 0.741798 0.4469
99999999 │ 0.239024 0.303297 0.252106 0.758419
100000000 │ 0.218885 0.00718995 0.423428 0.445263
99999991 rows omitted
julia> DataFrame(fetch(reduce(fit!, g, init=Mean())))
11×5 DataFrame
Row │ keys result_a result_b result_c result_d
│ Float64 Mean… Mean… Mean… Mean…
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.3 Mean: n=9999361 | value=0.299995 Mean: n=9999361 | value=0.500007 Mean: n=9999361 | value=0.499862 Mean: n=9999361 | value=0.500033
2 │ 0.5 Mean: n=10000103 | value=0.499998 Mean: n=10000103 | value=0.500115 Mean: n=10000103 | value=0.499992 Mean: n=10000103 | value=0.499952
3 │ 0.1 Mean: n=9998744 | value=0.100012 Mean: n=9998744 | value=0.499907 Mean: n=9998744 | value=0.499998 Mean: n=9998744 | value=0.500019
4 │ 1.0 Mean: n=4998931 | value=0.975002 Mean: n=4998931 | value=0.500034 Mean: n=4998931 | value=0.500215 Mean: n=4998931 | value=0.500095
5 │ 0.7 Mean: n=10001835 | value=0.700007 Mean: n=10001835 | value=0.499992 Mean: n=10001835 | value=0.500057 Mean: n=10001835 | value=0.499973
6 │ 0.4 Mean: n=9993281 | value=0.399998 Mean: n=9993281 | value=0.499982 Mean: n=9993281 | value=0.500018 Mean: n=9993281 | value=0.500071
7 │ 0.0 Mean: n=5000022 | value=0.0250048 Mean: n=5000022 | value=0.500231 Mean: n=5000022 | value=0.499983 Mean: n=5000022 | value=0.499828
8 │ 0.2 Mean: n=9999200 | value=0.199993 Mean: n=9999200 | value=0.500191 Mean: n=9999200 | value=0.499986 Mean: n=9999200 | value=0.500006
9 │ 0.9 Mean: n=10003295 | value=0.900009 Mean: n=10003295 | value=0.500001 Mean: n=10003295 | value=0.499859 Mean: n=10003295 | value=0.500005
10 │ 0.8 Mean: n=10000043 | value=0.799999 Mean: n=10000043 | value=0.500003 Mean: n=10000043 | value=0.499936 Mean: n=10000043 | value=0.500154
11 │ 0.6 Mean: n=10005185 | value=0.59999 Mean: n=10005185 | value=0.499964 Mean: n=10005185 | value=0.499976 Mean: n=10005185 | value=0.500099
|
64ae3c2
to
c380d4b
Compare
62f31a6
to
8066f34
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
Thanks a bunch, this is great work! |
This is for discrete values for now.
By default merges all the slices of the groups into a single chunk, but you can choose not to merge or you can choose the max chunksize on merging, so that you do not get huge chunks created.
Example:
There's also an index created:
todo:
finish the continous values grouping