Skip to content

DTable groupby #275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Oct 15, 2021
Merged

DTable groupby #275

merged 41 commits into from
Oct 15, 2021

Conversation

krynju
Copy link
Member

@krynju krynju commented Sep 1, 2021

This is for discrete values for now.
By default merges all the slices of the groups into a single chunk, but you can choose not to merge or you can choose the max chunksize on merging, so that you do not get huge chunks created.

Example:

julia> d = DTable((a=repeat(['a','b','c','d'], 6),), 4) # 6 chunks containing ['a','b','c','d'] each
DTable with 6 partitions
Tabletype: NamedTuple

julia> Dagger.groupby(:a, d) # merge=true, chunksize=0
DTable with 4 partitions
Tabletype: NamedTuple

julia> Dagger.groupby(:a, d, chunksize=1)
DTable with 24 partitions
Tabletype: NamedTuple

julia> Dagger.groupby(:a, d, merge=false)
DTable with 24 partitions
Tabletype: NamedTuple

julia> Dagger.groupby(:a, d, chunksize=3)
DTable with 8 partitions
Tabletype: NamedTuple

julia> d = DTable((a=repeat(collect(10:29), 6),), 4)
DTable with 30 partitions
Tabletype: NamedTuple

julia> Dagger.groupby(d, x -> x.a % 10, chunksize=6)
DTable with 20 partitions
Tabletype: NamedTuple

There's also an index created:

julia> Dagger.groupby(:a, d, chunksize=2)
DTable with 12 partitions
Tabletype: NamedTuple

julia> ans.groupby_index
Dict{Symbol, Dict{Char, Vector{Int64}}} with 1 entry:
  :a => Dict('a'=>[1, 5, 9], 'c'=>[3, 7, 11], 'd'=>[4, 8, 12], 'b'=>[2, 6, 10])

todo:

  • cleanup the code 50%
  • finish the continous values grouping
  • groupby with function input (f returns a group key)
  • figure out how the interface should look like
  • multiple columns groupby? (can be adjusted)
  • figure out how to use the index for applying functions per group
  • docs

Sorry, something went wrong.

@jpsamaroo
Copy link
Member

One initial comment: I prefer groupby(d, :a) instead of groupby(:a, d).

@codecov-commenter
Copy link

codecov-commenter commented Sep 5, 2021

Codecov Report

Merging #275 (62f31a6) into master (b3ec61e) will not change coverage.
The diff coverage is 0.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #275    +/-   ##
=======================================
  Coverage    0.00%   0.00%            
=======================================
  Files          35      37     +2     
  Lines        2939    3077   +138     
=======================================
- Misses       2939    3077   +138     
Impacted Files Coverage Δ
src/Dagger.jl 0.00% <ø> (ø)
src/table/gdtable.jl 0.00% <0.00%> (ø)
src/table/groupby.jl 0.00% <0.00%> (ø)
src/table/operations.jl 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3ec61e...62f31a6. Read the comment docs.

@krynju
Copy link
Member Author

krynju commented Sep 16, 2021

groupby reduce example:

julia> using Dagger, DataFrames, Arrow, OnlineStats

julia> d = DTable(Arrow.Table, "data/".*readdir("data"))
DTable with 100 partitions
Tabletype: unknown (use `tabletype!(::DTable)`)

julia> tabletype!(d)
NamedTuple

julia> g = Dagger.groupby(d, x->round(x.a, digits=1), chunksize=1_000_00);

julia> g
Dagger.GDTable(DTable with 1005 partitions
Tabletype: NamedTuple, nothing, Dict(0.3 => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  91, 92, 93, 94, 95, 96, 97, 98, 99, 100], 0.5 => [101, 102, 103, 104, 105, 106, 107, 108, 109, 110  …  191, 192, 193, 194, 195, 196, 197, 198, 199, 200], 0.1 => [201, 202, 203, 204, 205, 206, 207, 208, 209, 210  …  291, 292, 293, 294, 295, 296, 297, 298, 299, 300], 1.0 => [301, 302, 303, 304, 305, 306, 307, 308, 309, 310  …  342, 343, 344, 345, 346, 347, 348, 349, 350, 351], 0.7 => [352, 353, 354, 355, 356, 357, 358, 359, 360, 361  …  442, 443, 444, 445, 446, 447, 448, 449, 450, 451], 0.4 => [452, 453, 454, 455, 456, 457, 458, 459, 460, 461  …  542, 543, 544, 545, 546, 547, 548, 549, 550, 551], 0.0 => [552, 553, 554, 555, 556, 557, 558, 559, 560, 561  …  596, 597, 598, 599, 600, 601, 602, 603, 604, 605], 0.2 => [606, 607, 608, 609, 610, 611, 612, 613, 614, 615  …  696, 697, 698, 699, 700, 701, 702, 703, 704, 705], 0.9 => [706, 707, 708, 709, 710, 711, 712, 713, 714, 715  …  796, 797, 798, 799, 800, 801, 802, 803, 804, 805], 0.8 => [806, 807, 808, 809, 810, 811, 812, 813, 814, 815  …  896, 897, 898, 899, 900, 901, 902, 903, 904, 905]…))


julia> fetch(d, DataFrame)
100000000×4 DataFrame
       Row │ a           b           c         d
           │ Float64     Float64     Float64   Float64   
───────────┼─────────────────────────────────────────────
         10.75307     0.970277    0.659855  0.750826
         20.0998466   0.577389    0.8332    0.550904
         30.777338    0.886394    0.71068   0.348982
         40.272381    0.541937    0.696544  0.42361
         50.348195    0.840752    0.373067  0.290357
                                    
  999999970.0303199   0.307095    0.502027  0.0692123
  999999980.0931763   0.3375      0.741798  0.4469
  999999990.239024    0.303297    0.252106  0.758419
 1000000000.218885    0.00718995  0.423428  0.445263
                                    99999991 rows omitted


julia> DataFrame(fetch(reduce(fit!, g, init=Mean())))
11×5 DataFrame
 Row │ keys     result_a                           result_b                           result_c                           result_d
     │ Float64  Mean                              Mean                              Mean                              Mean
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   10.3  Mean: n=9999361 | value=0.299995   Mean: n=9999361 | value=0.500007   Mean: n=9999361 | value=0.499862   Mean: n=9999361 | value=0.500033
   20.5  Mean: n=10000103 | value=0.499998  Mean: n=10000103 | value=0.500115  Mean: n=10000103 | value=0.499992  Mean: n=10000103 | value=0.499952
   30.1  Mean: n=9998744 | value=0.100012   Mean: n=9998744 | value=0.499907   Mean: n=9998744 | value=0.499998   Mean: n=9998744 | value=0.500019
   41.0  Mean: n=4998931 | value=0.975002   Mean: n=4998931 | value=0.500034   Mean: n=4998931 | value=0.500215   Mean: n=4998931 | value=0.500095
   50.7  Mean: n=10001835 | value=0.700007  Mean: n=10001835 | value=0.499992  Mean: n=10001835 | value=0.500057  Mean: n=10001835 | value=0.499973
   60.4  Mean: n=9993281 | value=0.399998   Mean: n=9993281 | value=0.499982   Mean: n=9993281 | value=0.500018   Mean: n=9993281 | value=0.500071
   70.0  Mean: n=5000022 | value=0.0250048  Mean: n=5000022 | value=0.500231   Mean: n=5000022 | value=0.499983   Mean: n=5000022 | value=0.499828
   80.2  Mean: n=9999200 | value=0.199993   Mean: n=9999200 | value=0.500191   Mean: n=9999200 | value=0.499986   Mean: n=9999200 | value=0.500006
   90.9  Mean: n=10003295 | value=0.900009  Mean: n=10003295 | value=0.500001  Mean: n=10003295 | value=0.499859  Mean: n=10003295 | value=0.500005
  100.8  Mean: n=10000043 | value=0.799999  Mean: n=10000043 | value=0.500003  Mean: n=10000043 | value=0.499936  Mean: n=10000043 | value=0.500154
  110.6  Mean: n=10005185 | value=0.59999   Mean: n=10005185 | value=0.499964  Mean: n=10005185 | value=0.499976  Mean: n=10005185 | value=0.500099

@krynju krynju force-pushed the kr/dtable-groupby branch 2 times, most recently from 64ae3c2 to c380d4b Compare September 25, 2021 17:14
@krynju krynju marked this pull request as ready for review September 28, 2021 06:33
@krynju krynju force-pushed the kr/dtable-groupby branch from 62f31a6 to 8066f34 Compare October 8, 2021 20:27
Copy link
Member

@jpsamaroo jpsamaroo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

krynju and others added 4 commits October 14, 2021 19:21

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Co-authored-by: Julian Samaroo <[email protected]>
@jpsamaroo jpsamaroo merged commit 76eaefb into JuliaParallel:master Oct 15, 2021
@jpsamaroo
Copy link
Member

Thanks a bunch, this is great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants