[FEA] Request all data in a single `decompress_page_data` call per `read_parquet` #18205

GregoryKimball · 2025-03-10T04:55:22Z

Is your feature request related to a problem? Please describe.
As of pyarrow==18.1.0, the pyarrow parquet writer begins writing any strings column with dictionary encoding (e.g. use_dictionary=True and then falls back to plain encoding under some circumstances. I'm not sure if the fallback threshold is based on cardinality or the size of the growing dictionary page. Either way, for high cardinality strings columns, the pyarrow parquet writer (on by default in pandas) often produces strings columns with both dictionary-encoded and plain-encoded pages in the same column (in the same column chunk?).

When cudf reads a parquet file that has a mixture of dictionary- and plain-encoding in the same column, this results in two separate calls to decompress_page_data and two launches of the unsnap kernel in nvCOMP. Presumably the first call to decompress_page_data is just for the initial dictionary fragments. The performance of this first unsnap kernel is almost always very poor.

Here is a repro to observe the issue:

    rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
    path = '/raid/gkimball/tmp.pq'

    # make a high cardinality strings column
    row_count = 60_000_000
    df = cudf.DataFrame({
        'a': cupy.random.randint(0,2,row_count, dtype='int64'),
    })
    df['a'] = df['a'].cumsum()
    df['a'] = 'id_' + df['a'].astype(str)

    # write with pandas -> pyarrow with defaults
    df = df.to_pandas()   
    df.to_parquet(path)

    # read with cudf
    with nvtx.annotate("write default", 'yellow'):
        for _ in range(4):
            _ = cudf.read_parquet(path)

    # write with pandas -> pyarrow with dictionary disabled
    df.to_parquet(path, use_dictionary=False)

    # read with cudf
    with nvtx.annotate("write no dictionary", 'yellow'):
        for _ in range(4):
            _ = cudf.read_parquet(path)

default (cudf reads in 122 ms)

use_dictionary=False (cudf reads in 66 ms)

Describe the solution you'd like
When the mixed-encoding column occurs, we could:

combine the decompress_page_data calls from the dict and plain encoded pages together into a single call, continue decoding on the device
decompress the dictionary pages on the host, continue decoding on the device
decompress and decode the dictionary pages on the host, decompress and decode the plain pages on the device, and concatenate the result

Additional context

If you look at NDS-H SF10, all the pyarrow tables have two calls to decompress_page_data, but for cudf only the 4 biggest tables have two calls. I don't know why cudf would produce any of these mixed-encoding columns.

SF10
lineitem: cudf and pyarrow
supplier: pyarrow only
partsupp: pyarrow and cudf
part: pyarrow and cudf
orders: pyarrow and cudf
customer: pyarrow only

The text was updated successfully, but these errors were encountered:

GregoryKimball added the feature request New feature or request label Mar 10, 2025

GregoryKimball added this to the Parquet continuous improvement milestone Mar 10, 2025

GregoryKimball added this to libcudf Mar 10, 2025

GregoryKimball moved this to Needs owner in libcudf Mar 10, 2025

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Request all data in a single `decompress_page_data` call per `read_parquet` #18205

[FEA] Request all data in a single `decompress_page_data` call per `read_parquet` #18205

GregoryKimball commented Mar 10, 2025 •

edited

Loading

[FEA] Request all data in a single decompress_page_data call per read_parquet #18205

[FEA] Request all data in a single decompress_page_data call per read_parquet #18205

Comments

GregoryKimball commented Mar 10, 2025 • edited Loading

[FEA] Request all data in a single `decompress_page_data` call per `read_parquet` #18205

[FEA] Request all data in a single `decompress_page_data` call per `read_parquet` #18205

GregoryKimball commented Mar 10, 2025 •

edited

Loading