DTable groupby #275

krynju · 2021-09-01T17:37:44Z

This is for discrete values for now.
By default merges all the slices of the groups into a single chunk, but you can choose not to merge or you can choose the max chunksize on merging, so that you do not get huge chunks created.

Example:

julia> d = DTable((a=repeat(['a','b','c','d'], 6),), 4) # 6 chunks containing ['a','b','c','d'] each
DTable with 6 partitions
Tabletype: NamedTuple

julia> Dagger.groupby(:a, d) # merge=true, chunksize=0
DTable with 4 partitions
Tabletype: NamedTuple

julia> Dagger.groupby(:a, d, chunksize=1)
DTable with 24 partitions
Tabletype: NamedTuple

julia> Dagger.groupby(:a, d, merge=false)
DTable with 24 partitions
Tabletype: NamedTuple

julia> Dagger.groupby(:a, d, chunksize=3)
DTable with 8 partitions
Tabletype: NamedTuple

julia> d = DTable((a=repeat(collect(10:29), 6),), 4)
DTable with 30 partitions
Tabletype: NamedTuple

julia> Dagger.groupby(d, x -> x.a % 10, chunksize=6)
DTable with 20 partitions
Tabletype: NamedTuple

There's also an index created:

julia> Dagger.groupby(:a, d, chunksize=2)
DTable with 12 partitions
Tabletype: NamedTuple

julia> ans.groupby_index
Dict{Symbol, Dict{Char, Vector{Int64}}} with 1 entry:
  :a => Dict('a'=>[1, 5, 9], 'c'=>[3, 7, 11], 'd'=>[4, 8, 12], 'b'=>[2, 6, 10])

todo:

cleanup the code 50%
~~finish the continous values grouping~~
groupby with function input (f returns a group key)
figure out how the interface should look like
multiple columns groupby? (can be adjusted)
figure out how to use the index for applying functions per group
docs

jpsamaroo · 2021-09-01T19:57:31Z

One initial comment: I prefer groupby(d, :a) instead of groupby(:a, d).

codecov-commenter · 2021-09-05T14:42:55Z

Codecov Report

Merging #275 (62f31a6) into master (b3ec61e) will not change coverage.
The diff coverage is 0.00%.

@@           Coverage Diff           @@
##           master    #275    +/-   ##
=======================================
  Coverage    0.00%   0.00%            
=======================================
  Files          35      37     +2     
  Lines        2939    3077   +138     
=======================================
- Misses       2939    3077   +138

Impacted Files	Coverage Δ
src/Dagger.jl	`0.00% <ø> (ø)`
src/table/gdtable.jl	`0.00% <0.00%> (ø)`
src/table/groupby.jl	`0.00% <0.00%> (ø)`
src/table/operations.jl	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3ec61e...62f31a6. Read the comment docs.

krynju · 2021-09-16T18:50:01Z

groupby reduce example:

julia> using Dagger, DataFrames, Arrow, OnlineStats

julia> d = DTable(Arrow.Table, "data/".*readdir("data"))
DTable with 100 partitions
Tabletype: unknown (use `tabletype!(::DTable)`)

julia> tabletype!(d)
NamedTuple

julia> g = Dagger.groupby(d, x->round(x.a, digits=1), chunksize=1_000_00);

julia> g
Dagger.GDTable(DTable with 1005 partitions
Tabletype: NamedTuple, nothing, Dict(0.3 => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  91, 92, 93, 94, 95, 96, 97, 98, 99, 100], 0.5 => [101, 102, 103, 104, 105, 106, 107, 108, 109, 110  …  191, 192, 193, 194, 195, 196, 197, 198, 199, 200], 0.1 => [201, 202, 203, 204, 205, 206, 207, 208, 209, 210  …  291, 292, 293, 294, 295, 296, 297, 298, 299, 300], 1.0 => [301, 302, 303, 304, 305, 306, 307, 308, 309, 310  …  342, 343, 344, 345, 346, 347, 348, 349, 350, 351], 0.7 => [352, 353, 354, 355, 356, 357, 358, 359, 360, 361  …  442, 443, 444, 445, 446, 447, 448, 449, 450, 451], 0.4 => [452, 453, 454, 455, 456, 457, 458, 459, 460, 461  …  542, 543, 544, 545, 546, 547, 548, 549, 550, 551], 0.0 => [552, 553, 554, 555, 556, 557, 558, 559, 560, 561  …  596, 597, 598, 599, 600, 601, 602, 603, 604, 605], 0.2 => [606, 607, 608, 609, 610, 611, 612, 613, 614, 615  …  696, 697, 698, 699, 700, 701, 702, 703, 704, 705], 0.9 => [706, 707, 708, 709, 710, 711, 712, 713, 714, 715  …  796, 797, 798, 799, 800, 801, 802, 803, 804, 805], 0.8 => [806, 807, 808, 809, 810, 811, 812, 813, 814, 815  …  896, 897, 898, 899, 900, 901, 902, 903, 904, 905]…))


julia> fetch(d, DataFrame)
100000000×4 DataFrame
       Row │ a           b           c         d
           │ Float64     Float64     Float64   Float64   
───────────┼─────────────────────────────────────────────
         1 │ 0.75307     0.970277    0.659855  0.750826
         2 │ 0.0998466   0.577389    0.8332    0.550904
         3 │ 0.777338    0.886394    0.71068   0.348982
         4 │ 0.272381    0.541937    0.696544  0.42361
         5 │ 0.348195    0.840752    0.373067  0.290357
     ⋮     │     ⋮           ⋮          ⋮          ⋮
  99999997 │ 0.0303199   0.307095    0.502027  0.0692123
  99999998 │ 0.0931763   0.3375      0.741798  0.4469
  99999999 │ 0.239024    0.303297    0.252106  0.758419
 100000000 │ 0.218885    0.00718995  0.423428  0.445263
                                    99999991 rows omitted


julia> DataFrame(fetch(reduce(fit!, g, init=Mean())))
11×5 DataFrame
 Row │ keys     result_a                           result_b                           result_c                           result_d
     │ Float64  Mean…                              Mean…                              Mean…                              Mean…
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     0.3  Mean: n=9999361 | value=0.299995   Mean: n=9999361 | value=0.500007   Mean: n=9999361 | value=0.499862   Mean: n=9999361 | value=0.500033
   2 │     0.5  Mean: n=10000103 | value=0.499998  Mean: n=10000103 | value=0.500115  Mean: n=10000103 | value=0.499992  Mean: n=10000103 | value=0.499952
   3 │     0.1  Mean: n=9998744 | value=0.100012   Mean: n=9998744 | value=0.499907   Mean: n=9998744 | value=0.499998   Mean: n=9998744 | value=0.500019
   4 │     1.0  Mean: n=4998931 | value=0.975002   Mean: n=4998931 | value=0.500034   Mean: n=4998931 | value=0.500215   Mean: n=4998931 | value=0.500095
   5 │     0.7  Mean: n=10001835 | value=0.700007  Mean: n=10001835 | value=0.499992  Mean: n=10001835 | value=0.500057  Mean: n=10001835 | value=0.499973
   6 │     0.4  Mean: n=9993281 | value=0.399998   Mean: n=9993281 | value=0.499982   Mean: n=9993281 | value=0.500018   Mean: n=9993281 | value=0.500071
   7 │     0.0  Mean: n=5000022 | value=0.0250048  Mean: n=5000022 | value=0.500231   Mean: n=5000022 | value=0.499983   Mean: n=5000022 | value=0.499828
   8 │     0.2  Mean: n=9999200 | value=0.199993   Mean: n=9999200 | value=0.500191   Mean: n=9999200 | value=0.499986   Mean: n=9999200 | value=0.500006
   9 │     0.9  Mean: n=10003295 | value=0.900009  Mean: n=10003295 | value=0.500001  Mean: n=10003295 | value=0.499859  Mean: n=10003295 | value=0.500005
  10 │     0.8  Mean: n=10000043 | value=0.799999  Mean: n=10000043 | value=0.500003  Mean: n=10000043 | value=0.499936  Mean: n=10000043 | value=0.500154
  11 │     0.6  Mean: n=10005185 | value=0.59999   Mean: n=10005185 | value=0.499964  Mean: n=10005185 | value=0.499976  Mean: n=10005185 | value=0.500099

jpsamaroo

Awesome work!

docs/src/dtable.md

src/table/gdtable.jl

src/table/operations.jl

Co-authored-by: Julian Samaroo <[email protected]>

jpsamaroo · 2021-10-15T02:36:53Z

Thanks a bunch, this is great work!

krynju mentioned this pull request Jun 19, 2022

DTable TODO/Ideas JuliaParallel/DTables.jl#5

Open

20 tasks

jpsamaroo added enhancement table interface labels Sep 1, 2021

krynju force-pushed the kr/dtable-groupby branch 2 times, most recently from 64ae3c2 to c380d4b Compare September 25, 2021 17:14

krynju marked this pull request as ready for review September 28, 2021 06:33

krynju added 21 commits October 8, 2021 09:26

add proto1

dafc9cc

add merging

5ab554b

groupby discrete working

f562e43

separate file

f8e3cbb

single merge

0b51a8a

fix the merging algo

b3fab5d

add groupby with function input

aeb00e2

revert runtest edit

52ddb43

fix merging algo, test adjustments and cleanup

5ffc4e2

cleanup

2ef2a8d

add groupby on multiple cols

1d2b87f

rm temp fun

4717c4f

add GDTable prototype

0e0111e

add nicer reduce for grouped dtable

95439db

add map and filter for gdtable

fdf57f7

add test wip

3a57d44

add proto1

9da514a

add merging

b794710

fix the merging algo

23a8c1e

add groupby with function input

69207b6

revert runtest edit

f7c0afd

krynju added 14 commits October 8, 2021 09:26

fix merging algo, test adjustments and cleanup

e24efa5

cleanup

86e7611

add groupby on multiple cols

1c564ea

add GDTable prototype

97aaed2

add adjustments

565282d

add big groupby cleanup

452cd6f

fix branch

eb3d84e

add docs & adjustments

59684eb

add examples

d4e46fc

add docs

d64ede6

add docs and getindex

8dffe30

fix docstrings

dbd2ca4

add missing docs

cefac65

fix docs

8066f34

krynju force-pushed the kr/dtable-groupby branch from 62f31a6 to 8066f34 Compare October 8, 2021 20:27

krynju added 2 commits October 11, 2021 17:40

switch to tochunk from spawn(identity

bf543d6

fix ci?

7461bc5

jpsamaroo approved these changes Oct 12, 2021

View reviewed changes

krynju and others added 4 commits October 14, 2021 19:21

Apply suggestions from code review

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

b5b10b7

Co-authored-by: Julian Samaroo <[email protected]>

add review adjustments part 1

eed646e

add the rest of adjustments

9b05bc7

adjust the custom function print

680e700

jpsamaroo merged commit 76eaefb into JuliaParallel:master Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DTable groupby #275

DTable groupby #275

krynju commented Sep 1, 2021 •

edited

Loading

jpsamaroo commented Sep 1, 2021

codecov-commenter commented Sep 5, 2021 •

edited

Loading

krynju commented Sep 16, 2021

jpsamaroo left a comment

jpsamaroo commented Oct 15, 2021

DTable groupby #275

DTable groupby #275

Conversation

krynju commented Sep 1, 2021 • edited Loading

jpsamaroo commented Sep 1, 2021

codecov-commenter commented Sep 5, 2021 • edited Loading

Codecov Report

krynju commented Sep 16, 2021

jpsamaroo left a comment

Choose a reason for hiding this comment

jpsamaroo commented Oct 15, 2021

krynju commented Sep 1, 2021 •

edited

Loading

codecov-commenter commented Sep 5, 2021 •

edited

Loading