Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support function array_distinct #5221

Open
viadea opened this issue Apr 12, 2022 · 4 comments · May be fixed by #12306
Open

[FEA] Support function array_distinct #5221

viadea opened this issue Apr 12, 2022 · 4 comments · May be fixed by #12306
Labels
feature request New feature or request good first issue Good for newcomers

Comments

@viadea
Copy link
Collaborator

viadea commented Apr 12, 2022

I wish we can support function array_distinct.

Eg:

from pyspark.sql.functions import *
df = spark.createDataFrame([(["a", "b", "a"], ["b", "c"]), (["a","a"], ["b", "c"]), (["aa"], ["b", "c"])    ], ['x', 'y'])
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
df = spark.read.parquet("/tmp/testparquet")
df.select(array_distinct(df.x).alias("distinct")).collect()

Not-supported-messages:

    ! <ArrayDistinct> array_distinct(x#58) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.ArrayDistinct
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 12, 2022
@revans2
Copy link
Collaborator

revans2 commented Apr 14, 2022

I think CUDF already supports this through dropListDuplicates

https://github.com/rapidsai/cudf/blob/ac27757092e9ba2bc0656b6a7dfbc79ce8b5e76a/java/src/main/java/ai/rapids/cudf/ColumnView.java#L2375-L2386

We should be able to implement this without any issues, so long at dropListDuplicates supports the types.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Apr 19, 2022
@ttnghia ttnghia added the good first issue Good for newcomers label Nov 30, 2022
@phish3y
Copy link

phish3y commented Apr 10, 2024

I am interested in taking this. Could anyone point me in the right direction for which file (collectionOperations?) this would live in and maybe a comparable Gpu* case class (GpuArrayRemove?)?

Edit: Okay I see ArrayDistinct in the CPU version of collectionOperations so I think I'm on the right path

@revans2
Copy link
Collaborator

revans2 commented Apr 10, 2024

@phish3y happy to have you start to work on this.

https://github.com/apache/spark/blob/0d7c07047a628bd42eb53eb49935f5e3f81ea1a1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L4036

is the CPU implementation that we want to try and target. It looks like they have special case equality for NaNs and Nulls, but I am not sure if it is going to work with -0.0 vs 0.0 properly. We probably need to do some explicit testing on different versions of Spark.

The other thing to be careful of is that it appears that Spark is purposely keeping the order of the values in the array the same and only removing duplicates that come later. I am not sure if we need to replicate this functionality or not. It would be ideal if we could, but I don't think this is critical because it started to happen after a bug fix. apache/spark#33993

As for how you might be able to implement this I would suggest that you start with

https://github.com/rapidsai/cudf/blob/e727814c00ce0ae13febfeb44ca3d2db66f7f2e9/cpp/include/cudf/lists/stream_compaction.hpp#L87

using the java API for it
https://github.com/rapidsai/cudf/blob/e727814c00ce0ae13febfeb44ca3d2db66f7f2e9/java/src/main/java/ai/rapids/cudf/ColumnView.java#L2513

Then we can see what data types work well out of the box and if we have to add in some special case processing to make it work.

@warrickhe
Copy link
Contributor

Picking up this task. It would appear that we don't need to worry about order, as cudf contains the stable_distinct function which can be utilized to maintain the order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants