-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Estrin's Scheme for polynomial evaluation #924
Comments
There is no compelling reason we use Horner's rule over any other algorithm to my knowledge. If you provide a pull request I will certainly review it. For performance comparison we typically use google benchmark. You'll find numerous examples in reporting/performance folder. |
@thomasahle : This is a good idea, and would be a welcome addition by me! Only thing I worry about is that it's sometimes very hard to get the AVX instructions required by the scheme to generate; probably gotta spend some time with godbolt. In any case, maybe I should try Estrin's scheme to help with the struggles I'm having in my other MR! |
I remember reading somewhere that depending on the ratio of some two values, maybe the variable and something else, Horner's method was more accurate when the ratio was above x but another method, perhaps the naive one, was more accurate below x. Now I'm going to have to go back through the books to find it! |
@NAThompson I can try coding the AVX manually, but from my experience Estrin can often give a factor 2 speedup over Horner, even without AVX. I honestly don't get how it happens simply from rearranging the operations but it works for me at least 😅. Hopefully the Google benchmarks will agree. |
Generally speaking, for simple code like this, trying to second guess the optimizer results in worse code in most cases. We also have a few different methods for polynomial evaluation already, see https://www.boost.org/doc/libs/1_72_0/libs/math/doc/html/math_toolkit/tuning.html The second order Horner which is the default, is already 1.5-2x faster than naive Horner. BTW Most of the code is machine generated via: https://github.com/boostorg/math/blob/develop/tools/generate_rational_code.cpp That said, this is all old code now, so may well be worth revisiting and testing for modern processors. There's also some performance testing in https://github.com/boostorg/math/blob/develop/reporting/performance/test_poly_method.cpp |
@thomasahle : We generally use my garbage code ulps_plot.hpp. All the plots I made here are made with that tool. For an introduction to ulps plots, see here. |
@jzmaddock There's a cpp program for generating the cpp programs? I guess you don't fully trust the compiler after all :D Where do I see the generated code measured for the tables? |
For instance: include/boost/math/tools/detail/polynomial_horner3_9.hpp |
@thomasahle : Just a quick google/benchmark I hacked up for comparison: // (C) Copyright Nick Thompson 2023.
// Use, modification and distribution are subject to the
// Boost Software License, Version 1.0. (See accompanying file
// LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#include <complex>
#include <random>
#include <array>
#include <vector>
#include <iostream>
#include <benchmark/benchmark.h>
template<typename Real, typename RealOrComplex>
inline auto horners_method(const Real* coeffs, long n, RealOrComplex z) {
RealOrComplex result = coeffs[n - 1];
for (long i = n - 2; i >= 0; --i) {
if constexpr (std::is_same_v<Real, RealOrComplex>) {
result = std::fma(result, z, coeffs[i]);
} else {
result = result * z + coeffs[i];
}
}
return result;
}
template<class Real>
void HornersMethodRealCoeffsRealArg(benchmark::State& state)
{
long n = state.range(0);
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);
std::vector<Real> c(n);
std::fill(c.begin(), c.end(), unif(mt));
Real x = unif(mt);
for (auto _ : state)
{
Real output = horners_method(c.data(), n, x);
benchmark::DoNotOptimize(output);
x += std::sqrt(std::numeric_limits<Real>::epsilon());
}
state.SetComplexityN(state.range(0));
}
BENCHMARK_TEMPLATE(HornersMethodRealCoeffsRealArg, float)->Range(1, 1<<15)->Complexity();
BENCHMARK_TEMPLATE(HornersMethodRealCoeffsRealArg, double)->Range(1, 1 << 15)->Complexity();
template<class Real>
void HornersMethodRealCoeffsComplexArg(benchmark::State& state)
{
long n = state.range(0);
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);
std::vector<Real> c(n);
std::fill(c.begin(), c.end(), unif(mt));
std::complex<Real> x{unif(mt), unif(mt)};
for (auto _ : state)
{
std::complex<Real> output = horners_method(c.data(), n, x);
benchmark::DoNotOptimize(output);
x += std::sqrt(std::numeric_limits<Real>::epsilon());
}
state.SetComplexityN(state.range(0));
}
BENCHMARK_TEMPLATE(HornersMethodRealCoeffsComplexArg, float)->Range(1, 1<<15)->Complexity();
BENCHMARK_TEMPLATE(HornersMethodRealCoeffsComplexArg, double)->Range(1, 1 << 15)->Complexity();
template<typename Real, typename RealOrComplex>
inline auto second_order_horner(const Real* coeffs, long n, RealOrComplex z) {
RealOrComplex p1 = coeffs[n - 1];
RealOrComplex p2 = coeffs[n - 2];
RealOrComplex zsq = z*z;
for (long i = n - 3; i >= 0; i -= 2) {
p1 = p1 * zsq + coeffs[i];
p2 = p2 * zsq + coeffs[i-1];
}
return p1 + z*p2;
}
template<class Real>
void SecondOrderHornersMethodRealCoeffsRealArg(benchmark::State& state)
{
long n = state.range(0);
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);
std::vector<Real> c(n);
std::fill(c.begin(), c.end(), unif(mt));
Real x = unif(mt);
for (auto _ : state)
{
Real output = second_order_horner(c.data(), n, x);
benchmark::DoNotOptimize(output);
x += std::sqrt(std::numeric_limits<Real>::epsilon());
}
state.SetComplexityN(state.range(0));
}
BENCHMARK_TEMPLATE(SecondOrderHornersMethodRealCoeffsRealArg, float)->Range(1, 1<<15)->Complexity();
BENCHMARK_TEMPLATE(SecondOrderHornersMethodRealCoeffsRealArg, double)->Range(1, 1 << 15)->Complexity();
BENCHMARK_MAIN();
If you'd like to add your Estrin to this it'd be pretty useful I think . . . |
Ok, I ran the benchmarks now. I got pretty good results, unless I did something wrong, which is very possible. Looks like a factor 2-3 improvement over Horner and Second Order Horner for larger n.
Where I used this version of Estrin:
Note this only supports I normally have |
@thomasahle : Yeah, I just hacked this up really quick as an example; I should've actually made a bit of effort . . . |
Here is a version that doesn't assume n is a power of two. It runs just as fast.
I ran the benchmarks on a "Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz" with "-O3 and -match=native". |
At least in my wavelet MR, I do know the polynomial coefficients at compile time and ostensibly that's where this method would really shine. It just would be a bit awkward in the googlebenchmark file but who cares. |
If I run this on godbolt with
I wonder if the lack of compile time unfolding is why the n=8 case above suffers... |
There are also some methods that does preprocessing on the coefficients, and are able to use just n/2 multiplications. In my experiments these are even faster than Estrin. But of course, it requires a different API. |
@thomasahle : I've applied your Estrin method to my wavelet MR. The runtime goes from:
to:
Mind if I add your code to that MR and add you to the copyright?
I've considered that. . . in the wavelet MR I know the polynomials ahead of time and could precondition them. However I need to run it under |
Actually the 10 vanishing moment evaluation is way more dramatic: Before:
After:
|
Definitely feel free to! Glad it works! I still wonder why in the Google benchmarks it was slower for small n < 10...
Isn't exp also evaluated by a polynomial?
I can find some papers on it. We can also try to test it with your ulps_plot code. |
It wasn't slower; it was actually faster, just not as noticeable.
Well, the issue is relative expense. If I do the 3 vanishing moment wavelet, that corresponds to first computing z = exp(iω), and then evaluating a degree-3 polynomial in z. . . .
Good idea. Is there a well-known algorithm/implementation, say in Mathematica? I guess we don't have to even code it up if we can just copy/paste coeffs. (Just added your code and added you to the copyright of this MR.) |
You mean specifically a "precomputed coefficients" computation for exp? I haven't seen it actually. I looked in https://core.ac.uk/download/pdf/52321127.pdf Chapter 5, but I don't see any code specifically for exp. I don't mind doing the precomputation though.
Nice! |
Nah, I want to find the minimal operation polynomial for each of the coefficient sets here. I hope |
I wrote python code for Knuth Eve:
Here's the output:
It seems reasonably numerically stable, but you should play around with some polynomials for yourself. |
@thomasahle : This is awesome. At this point I believe that the first course of action is to create an MR with @jzmaddock , @mborland : Sound sensible? |
I think so. If we are really trying to avoid allocation the user could pass a pointer to pre-allocated space so it's independent of |
If we are using estrin for very large polynomials, possibly unbounded, it's probably best to just use it block-wise and combine the blocks with normal horner (using a sufficiently larger power of x). |
@thomasahle : I was kinda in a hurry to get this in, so I just created a pull request for this here. Obviously feel free to make PRs to the branch or otherwise add commentary! |
Some random comments: @NAThompson the line @thomasahle : The variable length array in your estrin code is a GCC-ism only, it's not part of C++ at all (though I wish it was!), I was also getting buffer-overrun errors from using In any case I've been playing around with some test code comparing to our existing code, here's what I'm using:
Results with msvc were basically meaningless - same time for every test - I suspect over-optimisation? gcc-cygwin did much better:
This is actually quite interesting: estrin is quite a bit slower for small fixed size arrays (the sort of thing we use for special function evaluation for example), but then starts to speed ahead for larger polynomials with a big difference at 32K size. I know I'm pretty unlikely to ever use polynomials that big, but I can't speak for @NAThompson ;) BTW I should add that our existing fixed-size polynomial code may well have an unfair advantage: they're loop unrolled and carefully coded to allow for parallel execution via a second order Horner scheme. It might be interesting to do something similar with small-order estrin, or else for estrin to delegate to Horner for smaller sizes. |
@jzmaddock : Fixing the issues you found in the pull request; should we begin migrating the discussion over there? |
@NAThompson the buffer length issue may effect your PR as well - or it may be a false alarm from msvc's static analysis - but basically it actually refused to compile the (N+1)/2 sized buffer at all on the grounds that it would overflow - first time I've seen that kind of compiler error! |
@jzmaddock : Yeah, I just changed that line to be more "boosty"-basically it gets allocated in a As to the buffer overflow: I ran it under address sanitizer and it was fine . . . |
According to https://www.boost.org/doc/libs/1_81_0/libs/math/doc/html/math_toolkit/rational.html Boost currently evaluates Polynomials (and rational functions) using Horner's rule.
However, a simple change of the algorithm, known as Estrin's scheme is known to both improve performance and reduce numerical instability.
Maybe there's a reason this method is not currently used?
Otherwise I can provide a pull request.
There are also quite a few other algorithms for polynomial evaluation: https://en.wikipedia.org/wiki/Polynomial_evaluation though most of them requires some preprocessing of the polynomial, so they are only useful if you are going to evaluate it many times. I don't know if something like that would be useful for Boost?
The text was updated successfully, but these errors were encountered: