[FEA] Request all data in a single decompress_page_data
call per read_parquet
#18205
Labels
Milestone
decompress_page_data
call per read_parquet
#18205
Is your feature request related to a problem? Please describe.
As of
pyarrow==18.1.0
, the pyarrow parquet writer begins writing any strings column with dictionary encoding (e.g.use_dictionary=True
and then falls back to plain encoding under some circumstances. I'm not sure if the fallback threshold is based on cardinality or the size of the growing dictionary page. Either way, for high cardinality strings columns, the pyarrow parquet writer (on by default in pandas) often produces strings columns with both dictionary-encoded and plain-encoded pages in the same column (in the same column chunk?).When cudf reads a parquet file that has a mixture of dictionary- and plain-encoding in the same column, this results in two separate calls to
decompress_page_data
and two launches of theunsnap
kernel in nvCOMP. Presumably the first call todecompress_page_data
is just for the initial dictionary fragments. The performance of this firstunsnap
kernel is almost always very poor.Here is a repro to observe the issue:
default (cudf reads in 122 ms)
use_dictionary=False
(cudf reads in 66 ms)Describe the solution you'd like
When the mixed-encoding column occurs, we could:
decompress_page_data
calls from the dict and plain encoded pages together into a single call, continue decoding on the deviceAdditional context
If you look at NDS-H SF10, all the pyarrow tables have two calls to
decompress_page_data
, but for cudf only the 4 biggest tables have two calls. I don't know why cudf would produce any of these mixed-encoding columns.The text was updated successfully, but these errors were encountered: