-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support function array_distinct #5221
Comments
I think CUDF already supports this through dropListDuplicates We should be able to implement this without any issues, so long at dropListDuplicates supports the types. |
I am interested in taking this. Could anyone point me in the right direction for which file ( Edit: Okay I see |
@phish3y happy to have you start to work on this. is the CPU implementation that we want to try and target. It looks like they have special case equality for NaNs and Nulls, but I am not sure if it is going to work with -0.0 vs 0.0 properly. We probably need to do some explicit testing on different versions of Spark. The other thing to be careful of is that it appears that Spark is purposely keeping the order of the values in the array the same and only removing duplicates that come later. I am not sure if we need to replicate this functionality or not. It would be ideal if we could, but I don't think this is critical because it started to happen after a bug fix. apache/spark#33993 As for how you might be able to implement this I would suggest that you start with using the java API for it Then we can see what data types work well out of the box and if we have to add in some special case processing to make it work. |
Picking up this task. It would appear that we don't need to worry about order, as cudf contains the stable_distinct function which can be utilized to maintain the order. |
I wish we can support function array_distinct.
Eg:
Not-supported-messages:
The text was updated successfully, but these errors were encountered: