Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add pipelining to the NDS-H-cpp benchmarks #18206

Open
GregoryKimball opened this issue Mar 10, 2025 · 0 comments
Open

[FEA] Add pipelining to the NDS-H-cpp benchmarks #18206

GregoryKimball opened this issue Mar 10, 2025 · 0 comments
Labels
feature request New feature or request good first issue Good for newcomers libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

Is your feature request related to a problem? Please describe.
In the libcudf microbenchmarks, the NDS-H-cpp benchmarks are a useful tool for studying GPU query performance.

They could also be used to study pipelining. An application can "pipeline" work on the GPU using 2 or more host threads to sequence calls to the libcudf public API. Pipelining is useful in IO-heavy workloads where one thread can be copying data to the GPU while another thread is running kernels over previously-copied data. Pipelining is needed to ensure that GPU compute is not left idle during copying steps.

Describe the solution you'd like
Claude and I wrote a simple concurrent benchmark for query 5 using PTDS. We could take this idea and update to use a CUDA stream pool. We would also want to consider how pipelining could be applied to other queries without modifying each query file.

void ndsh_q5_concurrent(nvbench::state& state)
{
  // Generate the required parquet files in device buffers
  double const scale_factor = state.get_float64("scale_factor");
  int const num_threads = state.get_int64("num_threads");
  int const runs_per_thread = state.get_int64("runs_per_thread");
  
  std::unordered_map<std::string, cuio_source_sink_pair> sources;
  generate_parquet_data_sources(
    scale_factor, {"customer", "orders", "lineitem", "supplier", "nation", "region"}, sources);
  
  BS::thread_pool threads(num_threads);
  
  state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
    nvtxRangePushA(("ndsh_q5_concurrent " + std::to_string(num_threads) + " threads, " + 
                    std::to_string(runs_per_thread) + " runs/thread, scale_factor=" + 
                    std::to_string(scale_factor)).c_str());
    auto query_func = [&](int index) {
      nvtxRangePushA("ndsh_q5");
      run_ndsh_q5(state, sources);
      nvtxRangePop();
    };

    threads.pause();
    threads.detach_sequence(0, num_threads * runs_per_thread, query_func);
    threads.unpause();
    threads.wait();
    nvtxRangePop();
  });
}

NVBENCH_BENCH(ndsh_q5_concurrent)
  .set_name("ndsh_q5_concurrent")
  .add_float64_axis("scale_factor", {0.01, 0.1, 1})
  .add_int64_axis("num_threads", {2, 4})
  .add_int64_axis("runs_per_thread", {1, 4});

The profiles show that query 5 is IO-bound and yet still has some bubbles where compute is running, but not IO. We should investigate why IO is blocking kernel work in some cases.
Image

@GregoryKimball GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. good first issue Good for newcomers labels Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: No status
Development

No branches or pull requests

1 participant