[FEA] Support function array_distinct #5221

viadea · 2022-04-12T20:51:37Z

I wish we can support function array_distinct.

Eg:

from pyspark.sql.functions import *
df = spark.createDataFrame([(["a", "b", "a"], ["b", "c"]), (["a","a"], ["b", "c"]), (["aa"], ["b", "c"])    ], ['x', 'y'])
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
df = spark.read.parquet("/tmp/testparquet")
df.select(array_distinct(df.x).alias("distinct")).collect()

Not-supported-messages:

    ! <ArrayDistinct> array_distinct(x#58) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.ArrayDistinct

The text was updated successfully, but these errors were encountered:

revans2 · 2022-04-14T13:53:06Z

I think CUDF already supports this through dropListDuplicates

https://github.com/rapidsai/cudf/blob/ac27757092e9ba2bc0656b6a7dfbc79ce8b5e76a/java/src/main/java/ai/rapids/cudf/ColumnView.java#L2375-L2386

We should be able to implement this without any issues, so long at dropListDuplicates supports the types.

phish3y · 2024-04-10T01:58:33Z

I am interested in taking this. Could anyone point me in the right direction for which file (collectionOperations?) this would live in and maybe a comparable Gpu* case class (GpuArrayRemove?)?

Edit: Okay I see ArrayDistinct in the CPU version of collectionOperations so I think I'm on the right path

revans2 · 2024-04-10T14:56:52Z

@phish3y happy to have you start to work on this.

https://github.com/apache/spark/blob/0d7c07047a628bd42eb53eb49935f5e3f81ea1a1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L4036

is the CPU implementation that we want to try and target. It looks like they have special case equality for NaNs and Nulls, but I am not sure if it is going to work with -0.0 vs 0.0 properly. We probably need to do some explicit testing on different versions of Spark.

The other thing to be careful of is that it appears that Spark is purposely keeping the order of the values in the array the same and only removing duplicates that come later. I am not sure if we need to replicate this functionality or not. It would be ideal if we could, but I don't think this is critical because it started to happen after a bug fix. apache/spark#33993

As for how you might be able to implement this I would suggest that you start with

https://github.com/rapidsai/cudf/blob/e727814c00ce0ae13febfeb44ca3d2db66f7f2e9/cpp/include/cudf/lists/stream_compaction.hpp#L87

using the java API for it
https://github.com/rapidsai/cudf/blob/e727814c00ce0ae13febfeb44ca3d2db66f7f2e9/java/src/main/java/ai/rapids/cudf/ColumnView.java#L2513

Then we can see what data types work well out of the box and if we have to add in some special case processing to make it work.

warrickhe · 2025-02-27T22:15:56Z

Picking up this task. It would appear that we don't need to worry about order, as cudf contains the stable_distinct function which can be utilized to maintain the order.

viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 12, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Apr 19, 2022

ttnghia added the good first issue Good for newcomers label Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support function array_distinct #5221

[FEA] Support function array_distinct #5221

viadea commented Apr 12, 2022

revans2 commented Apr 14, 2022

phish3y commented Apr 10, 2024 •

edited

Loading

revans2 commented Apr 10, 2024

warrickhe commented Feb 27, 2025

[FEA] Support function array_distinct #5221

[FEA] Support function array_distinct #5221

Comments

viadea commented Apr 12, 2022

revans2 commented Apr 14, 2022

phish3y commented Apr 10, 2024 • edited Loading

revans2 commented Apr 10, 2024

warrickhe commented Feb 27, 2025

phish3y commented Apr 10, 2024 •

edited

Loading