[FEA] Add pipelining to the `NDS-H-cpp` benchmarks #18206

GregoryKimball · 2025-03-10T05:55:18Z

Is your feature request related to a problem? Please describe.
In the libcudf microbenchmarks, the NDS-H-cpp benchmarks are a useful tool for studying GPU query performance.

They could also be used to study pipelining. An application can "pipeline" work on the GPU using 2 or more host threads to sequence calls to the libcudf public API. Pipelining is useful in IO-heavy workloads where one thread can be copying data to the GPU while another thread is running kernels over previously-copied data. Pipelining is needed to ensure that GPU compute is not left idle during copying steps.

Describe the solution you'd like
Claude and I wrote a simple concurrent benchmark for query 5 using PTDS. We could take this idea and update to use a CUDA stream pool. We would also want to consider how pipelining could be applied to other queries without modifying each query file.

void ndsh_q5_concurrent(nvbench::state& state)
{
  // Generate the required parquet files in device buffers
  double const scale_factor = state.get_float64("scale_factor");
  int const num_threads = state.get_int64("num_threads");
  int const runs_per_thread = state.get_int64("runs_per_thread");
  
  std::unordered_map<std::string, cuio_source_sink_pair> sources;
  generate_parquet_data_sources(
    scale_factor, {"customer", "orders", "lineitem", "supplier", "nation", "region"}, sources);
  
  BS::thread_pool threads(num_threads);
  
  state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
    nvtxRangePushA(("ndsh_q5_concurrent " + std::to_string(num_threads) + " threads, " + 
                    std::to_string(runs_per_thread) + " runs/thread, scale_factor=" + 
                    std::to_string(scale_factor)).c_str());
    auto query_func = [&](int index) {
      nvtxRangePushA("ndsh_q5");
      run_ndsh_q5(state, sources);
      nvtxRangePop();
    };

    threads.pause();
    threads.detach_sequence(0, num_threads * runs_per_thread, query_func);
    threads.unpause();
    threads.wait();
    nvtxRangePop();
  });
}

NVBENCH_BENCH(ndsh_q5_concurrent)
  .set_name("ndsh_q5_concurrent")
  .add_float64_axis("scale_factor", {0.01, 0.1, 1})
  .add_int64_axis("num_threads", {2, 4})
  .add_int64_axis("runs_per_thread", {1, 4});

The profiles show that query 5 is IO-bound and yet still has some bubbles where compute is running, but not IO. We should investigate why IO is blocking kernel work in some cases.

The text was updated successfully, but these errors were encountered:

GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. good first issue Good for newcomers labels Mar 10, 2025

GregoryKimball added this to libcudf Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add pipelining to the `NDS-H-cpp` benchmarks #18206

[FEA] Add pipelining to the `NDS-H-cpp` benchmarks #18206

GregoryKimball commented Mar 10, 2025

[FEA] Add pipelining to the NDS-H-cpp benchmarks #18206

[FEA] Add pipelining to the NDS-H-cpp benchmarks #18206

Comments

GregoryKimball commented Mar 10, 2025

[FEA] Add pipelining to the `NDS-H-cpp` benchmarks #18206

[FEA] Add pipelining to the `NDS-H-cpp` benchmarks #18206