Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable forced inlining in arithmetic #58

Closed
wants to merge 1 commit into from
Closed

Disable forced inlining in arithmetic #58

wants to merge 1 commit into from

Conversation

timholy
Copy link
Contributor

@timholy timholy commented Sep 26, 2015

I have a workload where I discovered that I got significantly better performance by deleting the @inline calls in the arithmetic functions. The workload is a little bit complicated, so I'll just show timing info.

Using master:

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

  7.253977 seconds (138.38 M allocations: 7.238 GB, 20.80% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  5.894770 seconds (137.43 M allocations: 7.197 GB, 24.73% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  5.984184 seconds (137.43 M allocations: 7.197 GB, 24.45% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  5.896002 seconds (137.43 M allocations: 7.197 GB, 24.55% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

Using this PR:

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

  5.137296 seconds (969.36 k allocations: 1.095 GB, 2.14% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  4.066025 seconds (15.38 k allocations: 1.054 GB, 4.84% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  3.983676 seconds (14.87 k allocations: 1.054 GB, 1.79% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  4.142620 seconds (14.87 k allocations: 1.054 GB, 2.05% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

It seems there are 121 remaining uses of @inline in the code. I bet there are many other places where it might be a good idea to remove it. (I generally try to not use it unless I've checked that it actually improves performance; don't know whether that was done here or not.)

@mlubin
Copy link
Contributor

mlubin commented Sep 26, 2015

Unfortunately the inlining instructions are necessary to get decent performance on many benchmarks. How are you using ForwardDiff? Chunk mode or vector mode?

@jrevels
Copy link
Member

jrevels commented Sep 26, 2015

To elaborate, we had some benchmarks in which normal Dual numbers were beating out single-epsilon GradientNumbers for no discernible reason (they are the same theoretical construction); it turns out that inlining the arithmetic operations fixed it.

This incident prompted me to believe "if you expect the compiler to inline this, you should just manually inline it yourself as the inlining heuristics in Base can't always be relied upon." I then applied the @inline macro to a lot of places where the function body was "small." I'm now not necessarily certain this was the right strategy, and I was probably more heavy-handed in my approach than I should've been...

@timholy Would it be possible for you to link/post the code where ForwardDiff is being applied?

@KristofferC
Copy link
Collaborator

The number of allocations went down with a factor of 10 000 by turning off inlining? Is it because there is some type instability and the function call acts like a barrier?

@timholy
Copy link
Contributor Author

timholy commented Sep 26, 2015

@KristofferC, that was my first guess---I sometimes deliberately put @noinline on functions that act as a barrier---but I didn't yet have time to track it down. I'll see if I can post something.

@timholy
Copy link
Contributor Author

timholy commented Sep 27, 2015

I've posted a reduced test case in this gist. The code in mygetindex comes from macroexpanding the @generated functions in Interpolations (I tested with the 0.3.0 release of Interpolations).

If I run with ForwardDiff master:

julia> include("perfFD.jl")
2-element Array{Float64,1}:
  123.305
 -952.366

julia> @time 1
  0.000006 seconds (148 allocations: 10.151 KB)
1

julia> @time sumA(itp.coefs, dz, irng, jrng)
  0.180838 seconds (5 allocations: 176 bytes)
2.4379931132108243e6

julia> @time gx, gy = gradsum(itp, dz, irng, jrng)
  0.223820 seconds (10 allocations: 368 bytes)
(123.30547906761386,-952.3663443995564)

julia> @time gg(dz)
  2.090166 seconds (87.67 M allocations: 2.613 GB, 21.22% gc time)
2-element Array{Float64,1}:
  123.305
 -952.366

If instead I use this PR:

julia> @time gg(dz)
  1.289657 seconds (9.74 M allocations: 297.263 MB, 3.88% gc time)
2-element Array{Float64,1}:
 -1072.05  
   -46.4687

@timholy
Copy link
Contributor Author

timholy commented Sep 27, 2015

Even more simplified test is the perfFD2.jl script in that same gist. It reduces the trouble to a single (complicated) line.

@mlubin
Copy link
Contributor

mlubin commented Sep 27, 2015

@timholy, could you try passing the chunk_size argument to ForwardDiff.gradient? That will likely have a significant improvement in performance.

@timholy
Copy link
Contributor Author

timholy commented Sep 27, 2015

I'm not exactly sure what chunk_size should be or what it does, but since my input is 2-dimensional (i.e., length(dz) = 2) I tried chunk_size=2.

julia> @time gg(dz)
  2.042326 seconds (87.67 M allocations: 2.613 GB, 22.40% gc time)
2-element Array{Float64,1}:
  899.534
 -347.626

julia> gg = ForwardDiff.gradient(dz->sumA(itp.coefs, dz, irng, jrng), chunk_size=2);

julia> @time gg(dz)
  2.053191 seconds (87.67 M allocations: 2.613 GB, 22.30% gc time)
2-element Array{Float64,1}:
  899.534
 -347.626

No difference.

@timholy
Copy link
Contributor Author

timholy commented Sep 27, 2015

Oh, and I didn't explain the obvious (to me): gradsum corresponds to a hand-written gradient computation, and it's about 10x faster than ForwardDiff. However, with all the memory allocation, I think we're not yet to the point of finding out the performance limits of ForwardDiff.

@timholy
Copy link
Contributor Author

timholy commented Sep 27, 2015

Also, the dependency on Interpolations is only for comparison sake. That gist reduces the performance problem to 6 GradientNumbers, 6 integers, 9 Float64s, 12 multiplies, and 8 adds.

@jrevels
Copy link
Member

jrevels commented Sep 27, 2015

I won't have time to really dig into this code until later (I need to pull METADATA to grab Interpolations.jl), so apologies if the answer here is obvious, but for this function:

function sumA(itp, dz, irng, jrng)
    s = 0.0
    dx, dy = dz[1], dz[2]
    for j in jrng
        for i in irng
            # s += itp[i+dx,j+dy]
            s += mygetindex(itp, i+dx, j+dy)
        end
    end
    s
end

Is the final type of s going to be a GradientNumber if dz holds GradientNumbers? If so, you're going to run into type instability/inferencing problems there due to initializing s as a float instead of saying something like s=zero(eltype(dz)).

P.S. There's documentation on chunk mode vs. vector mode here. One of the practical differences of the two is the use of tuples vs. vectors for partials storage.

@jrevels
Copy link
Member

jrevels commented Sep 27, 2015

P.P.S Using chunk-mode here shouldn't change anything if your input dimension is only 2; ForwardDiff.jl defaults to tuple storage for low input dimensions anyway, so manually triggering chunk-mode won't have any different effect.

@timholy
Copy link
Contributor Author

timholy commented Sep 27, 2015

Is the final type of s going to be a GradientNumber if dz holds GradientNumbers? If so, you're going to run into type instability/inferencing problems there due to initializing s as a float instead of saying something like s=zero(eltype(dz)).

Hmm, one would have thought that should have been obvious...

Helps, but still pretty slow. Master:

  1.698332 seconds (77.93 M allocations: 2.322 GB, 21.04% gc time)

This PR:

1.025838 seconds (130 allocations: 5.953 KB)

Number of allocations is way down, however!

@mlubin
Copy link
Contributor

mlubin commented Sep 27, 2015

Would be interesting to the see the performance of dual numbers here.

@timholy
Copy link
Contributor Author

timholy commented Sep 27, 2015

julia> @time sumA(itp.coefs, dzd, irng, jrng)
dz = DualNumbers.Dual{Float64}[0.302642145419342 + 1.0du,0.9309871095886213 + 0.0du]
  0.231094 seconds (54 allocations: 2.344 KB)
2.4333389979457306e6 + 298.16242299983907du

That's just for one component, so I suppose you'd have to double it. But it's still faster than ForwardDiff.

@timholy
Copy link
Contributor Author

timholy commented Sep 27, 2015

If I take out my @show dz addition, then DualNumbers is non-allocating for this problem. So it's not a type-stability problem with my code.

@KristofferC
Copy link
Collaborator

Isn't zero(eltype(dz)) wrong? eltype of a GradientNumber is just the type of it's real part, no?

Ex:

julia> nt
ForwardDiff.GradientNumber{2,Float64,Tuple{Float64,Float64}}

julia> zero(eltype(nt))
0.0

julia> zero(nt)
ForwardDiff.GradientNumber{2,Float64,Tuple{Float64,Float64}}(0.0,ForwardDiff.Partials{Float64,Tuple{Float64,Float64}}((0.0,0.0)))

@jrevels
Copy link
Member

jrevels commented Sep 27, 2015

Isn't zero(eltype(dz)) wrong? eltype of a GradientNumber is just the type of it's real part, no?

dz is a Vector{G<:GradientNumber}, unless I'm mistaken.

@KristofferC
Copy link
Collaborator

Yeah, that's right. Sorry.

@jrevels
Copy link
Member

jrevels commented Sep 28, 2015

Comparing this PR with master via our feeble suite of benchmarks:

Master:

julia> ack_bench = get_benchmark(ackley)
3x4 DataFrames.DataFrame
| Row | time        | func | xlen | chunk_size |
|-----|-------------|------|------|------------|
| 1   | 0.000118116 | 'f'  | 5000 | -1         |
| 2   | 0.867284    | 'g'  | 5000 | 1          |
| 3   | 0.208063    | 'g'  | 5000 | 5          |

julia> ros_bench = get_benchmark(rosenbrock)
3x4 DataFrames.DataFrame
| Row | time      | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1   | 1.244e-5  | 'f'  | 5000 | -1         |
| 2   | 0.0974855 | 'g'  | 5000 | 1          |
| 3   | 0.0636509 | 'g'  | 5000 | 5          |

julia> logit_bench = get_benchmark(self_weighted_logit)
3x4 DataFrames.DataFrame
| Row | time      | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1   | 6.689e-6  | 'f'  | 5000 | -1         |
| 2   | 0.131128  | 'g'  | 5000 | 1          |
| 3   | 0.0615632 | 'g'  | 5000 | 5          |

This PR:

julia> ack_bench = get_benchmark(ackley)
3x4 DataFrames.DataFrame
| Row | time        | func | xlen | chunk_size |
|-----|-------------|------|------|------------|
| 1   | 0.000119679 | 'f'  | 5000 | -1         |
| 2   | 1.15358     | 'g'  | 5000 | 1          |
| 3   | 0.297316    | 'g'  | 5000 | 5          |

julia> ros_bench = get_benchmark(rosenbrock)
3x4 DataFrames.DataFrame
| Row | time      | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1   | 1.2382e-5 | 'f'  | 5000 | -1         |
| 2   | 0.547275  | 'g'  | 5000 | 1          |
| 3   | 0.214978  | 'g'  | 5000 | 5          |

julia> logit_bench = get_benchmark(self_weighted_logit)
3x4 DataFrames.DataFrame
| Row | time      | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1   | 6.736e-6  | 'f'  | 5000 | -1         |
| 2   | 0.221926  | 'g'  | 5000 | 1          |
| 3   | 0.0869565 | 'g'  | 5000 | 5          |

Under the func column, 'f' denotes the original function evaluation time, while 'g' denotes the gradient evaluation time.

I swear, the next big addition to ForwardDiff will be a real ForwardDiffBenchmarks module...with docs.

@KristofferC
Copy link
Collaborator

Putting the ret line on three different rows reduces the number of allocations with 20 millions like so:

ret = cm_1 * (cm_2 * itp[ixm_1,ixm_2] + c_2 * itp[ixm_1,ix_2] + cp_2 * itp[ixm_1,ixp_2])
ret += c_1 * (cm_2 * itp[ix_1,ixm_2] + c_2 * itp[ix_1,ix_2] + cp_2 * itp[ix_1,ixp_2])
ret += cp_1 * (cm_2 * itp[ixp_1,ixm_2] + c_2 * itp[ixp_1,ix_2] + cp_2 * itp[ixp_1,ixp_2])

@KristofferC
Copy link
Collaborator

With the above change i get 58.44e6 allocations. Doing the math:

julia> 58.44*10^6 / (length(irng) * length(jrng))
11.999159812423812

means we are allocating 12 times per call. Looking at @code_llvm on mygetindex shows

  %fx_1 = alloca %GradientNumber, align 8
  %fx_2 = alloca %GradientNumber, align 8
  %4 = alloca %GradientNumber, align 8
  %5 = alloca %GradientNumber, align 8
  %6 = alloca %GradientNumber, align 8
  %7 = alloca %GradientNumber, align 8
  %8 = alloca %GradientNumber, align 8
  %9 = alloca %GradientNumber, align 8
  %10 = alloca %GradientNumber, align 8
  %11 = alloca %GradientNumber, align 8
  %12 = alloca %GradientNumber, align 8
  %13 = alloca %GradientNumber, align 8

which are the 12 allocations.

@jrevels
Copy link
Member

jrevels commented Sep 28, 2015

@KristofferC That's an interesting find, I'm seeing that as well on master. In this PR, I get a bit more:

  %fx_1 = alloca %GradientNumber, align 8
  %fx_2 = alloca %GradientNumber, align 8
  %cm_1 = alloca %GradientNumber, align 8
  %c_1 = alloca %GradientNumber, align 8
  %cp_1 = alloca %GradientNumber, align 8
  %cm_2 = alloca %GradientNumber, align 8
  %c_2 = alloca %GradientNumber, align 8
  %cp_2 = alloca %GradientNumber, align 8
  %ret = alloca %GradientNumber, align 8
  %4 = alloca %GradientNumber, align 8
  %5 = alloca %GradientNumber, align 8
  %6 = alloca %GradientNumber, align 8
  %7 = alloca %GradientNumber, align 8
  %8 = alloca %GradientNumber, align 8
  %9 = alloca %GradientNumber, align 8
  %10 = alloca %GradientNumber, align 8
  %11 = alloca %GradientNumber, align 8
.
.
.
  %60 = alloca %GradientNumber, align 8
  %61 = alloca %GradientNumber, align 8
  %62 = alloca %GradientNumber, align 8

Note that both of my checks for these had the ret optimization you recently posted. Comparing timings:

Master:

julia> @time mygetindex(itpcoefs, gi, gj)
  0.000003 seconds (17 allocations: 576 bytes)
ForwardDiff.GradientNumber{1,Float64,Tuple{Float64}}(0.7376966647789868,ForwardDiff.Partials{Float64,Tuple{Float64}}((0.0,)))

This PR:

julia> @time mygetindex(itpcoefs, gi, gj)
  0.000005 seconds (5 allocations: 192 bytes)
ForwardDiff.GradientNumber{1,Float64,Tuple{Float64}}(0.2262124251183511,ForwardDiff.Partials{Float64,Tuple{Float64}}((0.0,)))

with DualNumbers:

julia> @time mygetindex(itpcoefs, di, dj)
  0.000004 seconds (5 allocations: 192 bytes)
0.7376966647789868 + 0.0du

@KristofferC
Copy link
Collaborator

I guess I was wrong with that the 12 allocations was from alloca then? I read some in the LLVM manual and alloca seem to be allocating at the stack and @time shows heap allocations?

@timholy
Copy link
Contributor Author

timholy commented Sep 28, 2015

I got so busy with all the work that's going into JuliaLang/METADATA.jl#3544 that I had to leave this for a while, but I just put together another test script, see the "perfFD3.jl" file in that same gist. Here are my results:

julia> include("perfFD3.jl")
Warm up @time:
  0.000004 seconds (148 allocations: 10.151 KB)
1D linear, Interpolations
  0.016062 seconds (15 allocations: 624 bytes)
1d linear, hand implementation
  0.027657 seconds (15 allocations: 624 bytes)
1d quadratic, Interpolations
  0.103341 seconds (4.00 M allocations: 122.071 MB, 16.80% gc time)
2D linear
  0.029129 seconds (15 allocations: 624 bytes)
2D quadratic
  0.353004 seconds (16.00 M allocations: 488.282 MB, 20.86% gc time)
3D linear
  0.058516 seconds (15 allocations: 688 bytes)
3D quadratic
  1.159985 seconds (52.00 M allocations: 2.325 GB, 17.65% gc time)

In 1d, the difference between linear and quadratic interpolation is just

c*v + cp*vp

vs

cm*vm + c*v + cp*vp

so it really looks like there's some threshold in how the compiler elides allocations.

@timholy
Copy link
Contributor Author

timholy commented Sep 28, 2015

Woot! Adding parentheses in various places eliminates the allocation. See the latest file added to the gist, for quadratic interpolation I now get

julia> include("perfFD4.jl")
  0.082390 seconds (15 allocations: 624 bytes)

@jrevels
Copy link
Member

jrevels commented Sep 28, 2015

I'm seeing the same on my machine. That is to say:

linear perfFD3.jl calculations ---> master beats this PR
quad perfFD3.jl calculations ---> this PR beats master
perfFD4.jl calculation ---> master beats this PR

Thanks for all the investigative work here, @timholy & @KristofferC.

@timholy timholy deleted the pull-request/e38fb297 branch September 28, 2015 22:07
@timholy
Copy link
Contributor Author

timholy commented Sep 28, 2015

Cross-ref JuliaLang/julia#13350

@timholy
Copy link
Contributor Author

timholy commented Sep 28, 2015

Nice, it's now within ~2x of hand-written gradients (which is perhaps not surprising since it's got 2 parameters). I'll look forward to seeing how this scales with a little more testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants