Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] API for Creating List Columns from CuPy Arrays #18214

Open
VibhuJawa opened this issue Mar 10, 2025 · 0 comments
Open

[FEA] API for Creating List Columns from CuPy Arrays #18214

VibhuJawa opened this issue Mar 10, 2025 · 0 comments
Labels
feature request New feature or request

Comments

@VibhuJawa
Copy link
Member

Is your feature request related to a problem? Please describe.

Currently, there isn't a straightforward, stable API within cuDF for converting CuPy arrays (both 1D and 2D) into list columns. This functionality is repeatedly implemented externally (such as in NeMo Curator/Crossfit), causing recurrent breakages across different cuDF releases. For instance, previously addressed fixes include:

The current external implementation (shown below) has again broken in the latest cuDF version:

https://github.com/rapidsai/crossfit/blob/745208dc50d717dba5c35f6b75cc41a4678576bb/crossfit/backend/cudf/series.py#L57-L103

def _construct_list_column(
    size: int,
    dtype: ListDtype,
    mask: Optional["Buffer"] = None,
    offset: int = 0,
    null_count: Optional[int] = None,
    children: tuple["NumericalColumn", "ColumnBase"] = (),  # type: ignore[assignment]
) -> cudf.core.column.ListColumn:
    kwargs = dict(
        size=size,
        dtype=dtype,
        mask=mask,
        offset=offset,
        null_count=null_count,
        children=children,
    )

    if not _is_cudf_gte_24_10():
        return cudf.core.column.ListColumn(**kwargs)
    else:
        # in 24.10 ListColumn added `data` kwarg see https://github.com/rapidsai/crossfit/issues/84
        return cudf.core.column.ListColumn(data=None, **kwargs)


def create_list_series_from_1d_or_2d_ar(ar, index):
    """
    Create a cudf list series  from 2d arrays
    """
    if len(ar.shape) == 1:
        n_rows, *_ = ar.shape
        n_cols = 1
    elif len(ar.shape) == 2:
        n_rows, n_cols = ar.shape
    else:
        return RuntimeError(f"Unexpected input shape: {ar.shape}")
    data = as_column(ar.flatten())
    offset_col = as_column(cp.arange(start=0, stop=len(data) + 1, step=n_cols), dtype="int32")
    mask = cudf.Series(cp.full(shape=n_rows, fill_value=cp.bool_(True)))._column.as_mask()

    lc = _construct_list_column(
        size=n_rows,
        dtype=cudf.ListDtype(data.dtype),
        mask=mask,
        offset=0,
        null_count=0,
        children=(offset_col, data),
    )
    return _construct_series_from_list_column(lc=lc, index=index)

This results in the following error:

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/crossfit/backend/cudf/series.py", line 93, in create_list_series_from_1d_or_2d_ar
    offset_col = as_column(cp.arange(start=0, stop=len(data) + 1, step=n_cols), dtype="int32")
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/column/column.py", line 2483, in as_column
    col = col.astype(dtype)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/column/column.py", line 1624, in astype
    elif dtype.kind == "M":

Proposed Solution

Introduce a robust, well-maintained cuDF API to seamlessly convert CuPy arrays into cuDF list columns (both directions: CuPy array ↔ cuDF list). Providing this API will:

  • Reduce maintenance overhead from recurring breakages.

  • Support essential applications relying on this feature like NeMo Curator, GNNs and some potential RAG applications.

Additional Context

Supporting direct CuPy array ↔ cuDF list interoperability is crucial for various workflows leveraging RAPIDS + array/DL type workflows.

CC: @praateekmahajan, @sarahyurick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant