Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Request all data in a single decompress_page_data call per read_parquet #18205

Open
GregoryKimball opened this issue Mar 10, 2025 · 0 comments
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Mar 10, 2025

Is your feature request related to a problem? Please describe.
As of pyarrow==18.1.0, the pyarrow parquet writer begins writing any strings column with dictionary encoding (e.g. use_dictionary=True and then falls back to plain encoding under some circumstances. I'm not sure if the fallback threshold is based on cardinality or the size of the growing dictionary page. Either way, for high cardinality strings columns, the pyarrow parquet writer (on by default in pandas) often produces strings columns with both dictionary-encoded and plain-encoded pages in the same column (in the same column chunk?).

When cudf reads a parquet file that has a mixture of dictionary- and plain-encoding in the same column, this results in two separate calls to decompress_page_data and two launches of the unsnap kernel in nvCOMP. Presumably the first call to decompress_page_data is just for the initial dictionary fragments. The performance of this first unsnap kernel is almost always very poor.

Here is a repro to observe the issue:

    rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
    path = '/raid/gkimball/tmp.pq'

    # make a high cardinality strings column
    row_count = 60_000_000
    df = cudf.DataFrame({
        'a': cupy.random.randint(0,2,row_count, dtype='int64'),
    })
    df['a'] = df['a'].cumsum()
    df['a'] = 'id_' + df['a'].astype(str)

    # write with pandas -> pyarrow with defaults
    df = df.to_pandas()   
    df.to_parquet(path)

    # read with cudf
    with nvtx.annotate("write default", 'yellow'):
        for _ in range(4):
            _ = cudf.read_parquet(path)

    # write with pandas -> pyarrow with dictionary disabled
    df.to_parquet(path, use_dictionary=False)

    # read with cudf
    with nvtx.annotate("write no dictionary", 'yellow'):
        for _ in range(4):
            _ = cudf.read_parquet(path)

default (cudf reads in 122 ms)

Image

use_dictionary=False (cudf reads in 66 ms)

Image

Describe the solution you'd like
When the mixed-encoding column occurs, we could:

  • combine the decompress_page_data calls from the dict and plain encoded pages together into a single call, continue decoding on the device
  • decompress the dictionary pages on the host, continue decoding on the device
  • decompress and decode the dictionary pages on the host, decompress and decode the plain pages on the device, and concatenate the result

Additional context

If you look at NDS-H SF10, all the pyarrow tables have two calls to decompress_page_data, but for cudf only the 4 biggest tables have two calls. I don't know why cudf would produce any of these mixed-encoding columns.

SF10
lineitem: cudf and pyarrow
supplier: pyarrow only
partsupp: pyarrow and cudf
part: pyarrow and cudf
orders: pyarrow and cudf
customer: pyarrow only
@GregoryKimball GregoryKimball added the feature request New feature or request label Mar 10, 2025
@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Mar 10, 2025
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Needs owner
Development

No branches or pull requests

1 participant