-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable forced inlining in arithmetic #58
Conversation
Unfortunately the inlining instructions are necessary to get decent performance on many benchmarks. How are you using ForwardDiff? Chunk mode or vector mode? |
To elaborate, we had some benchmarks in which normal This incident prompted me to believe "if you expect the compiler to inline this, you should just manually inline it yourself as the inlining heuristics in Base can't always be relied upon." I then applied the @timholy Would it be possible for you to link/post the code where ForwardDiff is being applied? |
The number of allocations went down with a factor of 10 000 by turning off inlining? Is it because there is some type instability and the function call acts like a barrier? |
@KristofferC, that was my first guess---I sometimes deliberately put |
I've posted a reduced test case in this gist. The code in If I run with ForwardDiff master: julia> include("perfFD.jl")
2-element Array{Float64,1}:
123.305
-952.366
julia> @time 1
0.000006 seconds (148 allocations: 10.151 KB)
1
julia> @time sumA(itp.coefs, dz, irng, jrng)
0.180838 seconds (5 allocations: 176 bytes)
2.4379931132108243e6
julia> @time gx, gy = gradsum(itp, dz, irng, jrng)
0.223820 seconds (10 allocations: 368 bytes)
(123.30547906761386,-952.3663443995564)
julia> @time gg(dz)
2.090166 seconds (87.67 M allocations: 2.613 GB, 21.22% gc time)
2-element Array{Float64,1}:
123.305
-952.366 If instead I use this PR: julia> @time gg(dz)
1.289657 seconds (9.74 M allocations: 297.263 MB, 3.88% gc time)
2-element Array{Float64,1}:
-1072.05
-46.4687 |
Even more simplified test is the |
@timholy, could you try passing the |
I'm not exactly sure what julia> @time gg(dz)
2.042326 seconds (87.67 M allocations: 2.613 GB, 22.40% gc time)
2-element Array{Float64,1}:
899.534
-347.626
julia> gg = ForwardDiff.gradient(dz->sumA(itp.coefs, dz, irng, jrng), chunk_size=2);
julia> @time gg(dz)
2.053191 seconds (87.67 M allocations: 2.613 GB, 22.30% gc time)
2-element Array{Float64,1}:
899.534
-347.626 No difference. |
Oh, and I didn't explain the obvious (to me): |
Also, the dependency on Interpolations is only for comparison sake. That gist reduces the performance problem to 6 |
I won't have time to really dig into this code until later (I need to pull METADATA to grab Interpolations.jl), so apologies if the answer here is obvious, but for this function: function sumA(itp, dz, irng, jrng)
s = 0.0
dx, dy = dz[1], dz[2]
for j in jrng
for i in irng
# s += itp[i+dx,j+dy]
s += mygetindex(itp, i+dx, j+dy)
end
end
s
end Is the final type of P.S. There's documentation on chunk mode vs. vector mode here. One of the practical differences of the two is the use of tuples vs. vectors for partials storage. |
P.P.S Using chunk-mode here shouldn't change anything if your input dimension is only 2; ForwardDiff.jl defaults to tuple storage for low input dimensions anyway, so manually triggering chunk-mode won't have any different effect. |
Hmm, one would have thought that should have been obvious... Helps, but still pretty slow. Master: 1.698332 seconds (77.93 M allocations: 2.322 GB, 21.04% gc time) This PR: 1.025838 seconds (130 allocations: 5.953 KB) Number of allocations is way down, however! |
Would be interesting to the see the performance of dual numbers here. |
That's just for one component, so I suppose you'd have to double it. But it's still faster than ForwardDiff. |
If I take out my |
Isn't Ex: julia> nt
ForwardDiff.GradientNumber{2,Float64,Tuple{Float64,Float64}}
julia> zero(eltype(nt))
0.0
julia> zero(nt)
ForwardDiff.GradientNumber{2,Float64,Tuple{Float64,Float64}}(0.0,ForwardDiff.Partials{Float64,Tuple{Float64,Float64}}((0.0,0.0))) |
|
Yeah, that's right. Sorry. |
Comparing this PR with master via our feeble suite of benchmarks: Master: julia> ack_bench = get_benchmark(ackley)
3x4 DataFrames.DataFrame
| Row | time | func | xlen | chunk_size |
|-----|-------------|------|------|------------|
| 1 | 0.000118116 | 'f' | 5000 | -1 |
| 2 | 0.867284 | 'g' | 5000 | 1 |
| 3 | 0.208063 | 'g' | 5000 | 5 |
julia> ros_bench = get_benchmark(rosenbrock)
3x4 DataFrames.DataFrame
| Row | time | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1 | 1.244e-5 | 'f' | 5000 | -1 |
| 2 | 0.0974855 | 'g' | 5000 | 1 |
| 3 | 0.0636509 | 'g' | 5000 | 5 |
julia> logit_bench = get_benchmark(self_weighted_logit)
3x4 DataFrames.DataFrame
| Row | time | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1 | 6.689e-6 | 'f' | 5000 | -1 |
| 2 | 0.131128 | 'g' | 5000 | 1 |
| 3 | 0.0615632 | 'g' | 5000 | 5 | This PR: julia> ack_bench = get_benchmark(ackley)
3x4 DataFrames.DataFrame
| Row | time | func | xlen | chunk_size |
|-----|-------------|------|------|------------|
| 1 | 0.000119679 | 'f' | 5000 | -1 |
| 2 | 1.15358 | 'g' | 5000 | 1 |
| 3 | 0.297316 | 'g' | 5000 | 5 |
julia> ros_bench = get_benchmark(rosenbrock)
3x4 DataFrames.DataFrame
| Row | time | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1 | 1.2382e-5 | 'f' | 5000 | -1 |
| 2 | 0.547275 | 'g' | 5000 | 1 |
| 3 | 0.214978 | 'g' | 5000 | 5 |
julia> logit_bench = get_benchmark(self_weighted_logit)
3x4 DataFrames.DataFrame
| Row | time | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1 | 6.736e-6 | 'f' | 5000 | -1 |
| 2 | 0.221926 | 'g' | 5000 | 1 |
| 3 | 0.0869565 | 'g' | 5000 | 5 | Under the I swear, the next big addition to ForwardDiff will be a real |
Putting the ret = cm_1 * (cm_2 * itp[ixm_1,ixm_2] + c_2 * itp[ixm_1,ix_2] + cp_2 * itp[ixm_1,ixp_2])
ret += c_1 * (cm_2 * itp[ix_1,ixm_2] + c_2 * itp[ix_1,ix_2] + cp_2 * itp[ix_1,ixp_2])
ret += cp_1 * (cm_2 * itp[ixp_1,ixm_2] + c_2 * itp[ixp_1,ix_2] + cp_2 * itp[ixp_1,ixp_2]) |
With the above change i get 58.44e6 allocations. Doing the math: julia> 58.44*10^6 / (length(irng) * length(jrng))
11.999159812423812 means we are allocating 12 times per call. Looking at %fx_1 = alloca %GradientNumber, align 8
%fx_2 = alloca %GradientNumber, align 8
%4 = alloca %GradientNumber, align 8
%5 = alloca %GradientNumber, align 8
%6 = alloca %GradientNumber, align 8
%7 = alloca %GradientNumber, align 8
%8 = alloca %GradientNumber, align 8
%9 = alloca %GradientNumber, align 8
%10 = alloca %GradientNumber, align 8
%11 = alloca %GradientNumber, align 8
%12 = alloca %GradientNumber, align 8
%13 = alloca %GradientNumber, align 8 which are the 12 allocations. |
@KristofferC That's an interesting find, I'm seeing that as well on master. In this PR, I get a bit more: %fx_1 = alloca %GradientNumber, align 8
%fx_2 = alloca %GradientNumber, align 8
%cm_1 = alloca %GradientNumber, align 8
%c_1 = alloca %GradientNumber, align 8
%cp_1 = alloca %GradientNumber, align 8
%cm_2 = alloca %GradientNumber, align 8
%c_2 = alloca %GradientNumber, align 8
%cp_2 = alloca %GradientNumber, align 8
%ret = alloca %GradientNumber, align 8
%4 = alloca %GradientNumber, align 8
%5 = alloca %GradientNumber, align 8
%6 = alloca %GradientNumber, align 8
%7 = alloca %GradientNumber, align 8
%8 = alloca %GradientNumber, align 8
%9 = alloca %GradientNumber, align 8
%10 = alloca %GradientNumber, align 8
%11 = alloca %GradientNumber, align 8
.
.
.
%60 = alloca %GradientNumber, align 8
%61 = alloca %GradientNumber, align 8
%62 = alloca %GradientNumber, align 8 Note that both of my checks for these had the Master: julia> @time mygetindex(itpcoefs, gi, gj)
0.000003 seconds (17 allocations: 576 bytes)
ForwardDiff.GradientNumber{1,Float64,Tuple{Float64}}(0.7376966647789868,ForwardDiff.Partials{Float64,Tuple{Float64}}((0.0,))) This PR: julia> @time mygetindex(itpcoefs, gi, gj)
0.000005 seconds (5 allocations: 192 bytes)
ForwardDiff.GradientNumber{1,Float64,Tuple{Float64}}(0.2262124251183511,ForwardDiff.Partials{Float64,Tuple{Float64}}((0.0,))) with DualNumbers: julia> @time mygetindex(itpcoefs, di, dj)
0.000004 seconds (5 allocations: 192 bytes)
0.7376966647789868 + 0.0du |
I guess I was wrong with that the 12 allocations was from |
I got so busy with all the work that's going into JuliaLang/METADATA.jl#3544 that I had to leave this for a while, but I just put together another test script, see the "perfFD3.jl" file in that same gist. Here are my results: julia> include("perfFD3.jl")
Warm up @time:
0.000004 seconds (148 allocations: 10.151 KB)
1D linear, Interpolations
0.016062 seconds (15 allocations: 624 bytes)
1d linear, hand implementation
0.027657 seconds (15 allocations: 624 bytes)
1d quadratic, Interpolations
0.103341 seconds (4.00 M allocations: 122.071 MB, 16.80% gc time)
2D linear
0.029129 seconds (15 allocations: 624 bytes)
2D quadratic
0.353004 seconds (16.00 M allocations: 488.282 MB, 20.86% gc time)
3D linear
0.058516 seconds (15 allocations: 688 bytes)
3D quadratic
1.159985 seconds (52.00 M allocations: 2.325 GB, 17.65% gc time) In 1d, the difference between linear and quadratic interpolation is just c*v + cp*vp vs cm*vm + c*v + cp*vp so it really looks like there's some threshold in how the compiler elides allocations. |
Woot! Adding parentheses in various places eliminates the allocation. See the latest file added to the gist, for quadratic interpolation I now get julia> include("perfFD4.jl")
0.082390 seconds (15 allocations: 624 bytes) |
I'm seeing the same on my machine. That is to say: linear perfFD3.jl calculations ---> master beats this PR Thanks for all the investigative work here, @timholy & @KristofferC. |
Cross-ref JuliaLang/julia#13350 |
Nice, it's now within ~2x of hand-written gradients (which is perhaps not surprising since it's got 2 parameters). I'll look forward to seeing how this scales with a little more testing. |
I have a workload where I discovered that I got significantly better performance by deleting the
@inline
calls in the arithmetic functions. The workload is a little bit complicated, so I'll just show timing info.Using master:
Using this PR:
It seems there are 121 remaining uses of
@inline
in the code. I bet there are many other places where it might be a good idea to remove it. (I generally try to not use it unless I've checked that it actually improves performance; don't know whether that was done here or not.)