-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark work estimate #802
base: develop
Are you sure you want to change the base?
Conversation
SonarCloud Quality Gate failed. |
Codecov Report
@@ Coverage Diff @@
## develop #802 +/- ##
===========================================
+ Coverage 90.50% 94.19% +3.68%
===========================================
Files 505 401 -104
Lines 43856 32156 -11700
===========================================
- Hits 39693 30289 -9404
+ Misses 4163 1867 -2296
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
I think one limit with this technic if I understand properly is that it requires two consistencies:
I'm not sure about 1. for now, I think most of our algorithms have roughly the same order of magnitude of work between executors, but that means a design like the CSR SpMV (classical, imbalance, ...) with strategies would be a no go, and instead we would need an operation for each strategy and switch strategy at the core/algorithm level (which is maybe the best thing to do anyway). I don't think it's a downside, I thought I should just mention it. |
e4b0524
to
dea9c36
Compare
Note: This PR changes the Ginkgo ABI:
For details check the full ABI diff under Artifacts here |
dea9c36
to
83a73da
Compare
83a73da
to
97aa307
Compare
I think instead of artificially classifying operations into likely compute bound and likely memory bound (which is very much dependent on hardware), IMO a better approach would be to just calculate the work and memory complexities of the operations and register them for each operation. We can then have a roofline estimator (which could take in the hardware properties), to estimate whether the operation is memory-bound or compute bound. |
I don't think the classification is particularly artificial, let me formulate it as
There are many kernels where it doesn't make sense to talk of FLOPS, or that don't allow for a nice closed-form expression of their memory footprint/compute complexity, which is why I want to leave the option open to either not annotate kernels at all, or to annotate them with custom metrics. |
97aa307
to
c984d91
Compare
general question: how different between this and the profiler result? |
Yes, there are more precise models or exact performance counters. This is aiming mostly to provide a rootline-like approximation to the performance, to quickly highlight kernels that are significantly below expected performance, enabling users to highlight possible optimization opportunities. We can use this framework to capture such information on an application level without the need for executing with a profiler, which requires additional tooling for analysis. The BLAS 1/2 and solver kernels should be pretty accurate, only the SpMVs undercount accesses to the input vector, which should ideally be served from cache though. IIRC, the footprints are equivalent to what we used in our ACM TOMS paper to report achieved bandwidths for different SpMVs and solvers. |
This PR adds work estimates to the executor Operations, implements them for a few Dense kernels
and outputs them in benchmark loggers.Related to #1784