Disable forced inlining in arithmetic #58

timholy · 2015-09-26T13:55:13Z

I have a workload where I discovered that I got significantly better performance by deleting the @inline calls in the arithmetic functions. The workload is a little bit complicated, so I'll just show timing info.

Using master:

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

  7.253977 seconds (138.38 M allocations: 7.238 GB, 20.80% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  5.894770 seconds (137.43 M allocations: 7.197 GB, 24.73% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  5.984184 seconds (137.43 M allocations: 7.197 GB, 24.45% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  5.896002 seconds (137.43 M allocations: 7.197 GB, 24.55% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

Using this PR:

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

  5.137296 seconds (969.36 k allocations: 1.095 GB, 2.14% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  4.066025 seconds (15.38 k allocations: 1.054 GB, 4.84% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  3.983676 seconds (14.87 k allocations: 1.054 GB, 1.79% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

julia> @time RegisterAffineAD.optimize_rigid(fixed, moving)
  4.142620 seconds (14.87 k allocations: 1.054 GB, 2.05% gc time)
([-0.26290299117443555,-0.04562260395244045,0.09707317047564838],0.0007446994302309248)

It seems there are 121 remaining uses of @inline in the code. I bet there are many other places where it might be a good idea to remove it. (I generally try to not use it unless I've checked that it actually improves performance; don't know whether that was done here or not.)

mlubin · 2015-09-26T15:12:21Z

Unfortunately the inlining instructions are necessary to get decent performance on many benchmarks. How are you using ForwardDiff? Chunk mode or vector mode?

jrevels · 2015-09-26T15:26:51Z

To elaborate, we had some benchmarks in which normal Dual numbers were beating out single-epsilon GradientNumbers for no discernible reason (they are the same theoretical construction); it turns out that inlining the arithmetic operations fixed it.

This incident prompted me to believe "if you expect the compiler to inline this, you should just manually inline it yourself as the inlining heuristics in Base can't always be relied upon." I then applied the @inline macro to a lot of places where the function body was "small." I'm now not necessarily certain this was the right strategy, and I was probably more heavy-handed in my approach than I should've been...

@timholy Would it be possible for you to link/post the code where ForwardDiff is being applied?

KristofferC · 2015-09-26T16:22:01Z

The number of allocations went down with a factor of 10 000 by turning off inlining? Is it because there is some type instability and the function call acts like a barrier?

timholy · 2015-09-26T17:23:07Z

@KristofferC, that was my first guess---I sometimes deliberately put @noinline on functions that act as a barrier---but I didn't yet have time to track it down. I'll see if I can post something.

timholy · 2015-09-27T11:26:56Z

I've posted a reduced test case in this gist. The code in mygetindex comes from macroexpanding the @generated functions in Interpolations (I tested with the 0.3.0 release of Interpolations).

If I run with ForwardDiff master:

julia> include("perfFD.jl")
2-element Array{Float64,1}:
  123.305
 -952.366

julia> @time 1
  0.000006 seconds (148 allocations: 10.151 KB)
1

julia> @time sumA(itp.coefs, dz, irng, jrng)
  0.180838 seconds (5 allocations: 176 bytes)
2.4379931132108243e6

julia> @time gx, gy = gradsum(itp, dz, irng, jrng)
  0.223820 seconds (10 allocations: 368 bytes)
(123.30547906761386,-952.3663443995564)

julia> @time gg(dz)
  2.090166 seconds (87.67 M allocations: 2.613 GB, 21.22% gc time)
2-element Array{Float64,1}:
  123.305
 -952.366

If instead I use this PR:

julia> @time gg(dz)
  1.289657 seconds (9.74 M allocations: 297.263 MB, 3.88% gc time)
2-element Array{Float64,1}:
 -1072.05  
   -46.4687

timholy · 2015-09-27T12:08:47Z

Even more simplified test is the perfFD2.jl script in that same gist. It reduces the trouble to a single (complicated) line.

mlubin · 2015-09-27T16:44:39Z

@timholy, could you try passing the chunk_size argument to ForwardDiff.gradient? That will likely have a significant improvement in performance.

timholy · 2015-09-27T17:22:04Z

I'm not exactly sure what chunk_size should be or what it does, but since my input is 2-dimensional (i.e., length(dz) = 2) I tried chunk_size=2.

julia> @time gg(dz)
  2.042326 seconds (87.67 M allocations: 2.613 GB, 22.40% gc time)
2-element Array{Float64,1}:
  899.534
 -347.626

julia> gg = ForwardDiff.gradient(dz->sumA(itp.coefs, dz, irng, jrng), chunk_size=2);

julia> @time gg(dz)
  2.053191 seconds (87.67 M allocations: 2.613 GB, 22.30% gc time)
2-element Array{Float64,1}:
  899.534
 -347.626

No difference.

timholy · 2015-09-27T17:24:46Z

Oh, and I didn't explain the obvious (to me): gradsum corresponds to a hand-written gradient computation, and it's about 10x faster than ForwardDiff. However, with all the memory allocation, I think we're not yet to the point of finding out the performance limits of ForwardDiff.

timholy · 2015-09-27T17:57:07Z

Also, the dependency on Interpolations is only for comparison sake. That gist reduces the performance problem to 6 GradientNumbers, 6 integers, 9 Float64s, 12 multiplies, and 8 adds.

jrevels · 2015-09-27T19:03:39Z

I won't have time to really dig into this code until later (I need to pull METADATA to grab Interpolations.jl), so apologies if the answer here is obvious, but for this function:

function sumA(itp, dz, irng, jrng)
    s = 0.0
    dx, dy = dz[1], dz[2]
    for j in jrng
        for i in irng
            # s += itp[i+dx,j+dy]
            s += mygetindex(itp, i+dx, j+dy)
        end
    end
    s
end

Is the final type of s going to be a GradientNumber if dz holds GradientNumbers? If so, you're going to run into type instability/inferencing problems there due to initializing s as a float instead of saying something like s=zero(eltype(dz)).

P.S. There's documentation on chunk mode vs. vector mode here. One of the practical differences of the two is the use of tuples vs. vectors for partials storage.

jrevels · 2015-09-27T19:16:30Z

P.P.S Using chunk-mode here shouldn't change anything if your input dimension is only 2; ForwardDiff.jl defaults to tuple storage for low input dimensions anyway, so manually triggering chunk-mode won't have any different effect.

timholy · 2015-09-27T20:43:05Z

Is the final type of s going to be a GradientNumber if dz holds GradientNumbers? If so, you're going to run into type instability/inferencing problems there due to initializing s as a float instead of saying something like s=zero(eltype(dz)).

Hmm, one would have thought that should have been obvious...

Helps, but still pretty slow. Master:

  1.698332 seconds (77.93 M allocations: 2.322 GB, 21.04% gc time)

This PR:

1.025838 seconds (130 allocations: 5.953 KB)

Number of allocations is way down, however!

mlubin · 2015-09-27T20:48:49Z

Would be interesting to the see the performance of dual numbers here.

timholy · 2015-09-27T21:08:20Z

julia> @time sumA(itp.coefs, dzd, irng, jrng)
dz = DualNumbers.Dual{Float64}[0.302642145419342 + 1.0du,0.9309871095886213 + 0.0du]
  0.231094 seconds (54 allocations: 2.344 KB)
2.4333389979457306e6 + 298.16242299983907du

That's just for one component, so I suppose you'd have to double it. But it's still faster than ForwardDiff.

timholy · 2015-09-27T21:09:53Z

If I take out my @show dz addition, then DualNumbers is non-allocating for this problem. So it's not a type-stability problem with my code.

KristofferC · 2015-09-27T23:57:40Z

Isn't zero(eltype(dz)) wrong? eltype of a GradientNumber is just the type of it's real part, no?

Ex:

julia> nt
ForwardDiff.GradientNumber{2,Float64,Tuple{Float64,Float64}}

julia> zero(eltype(nt))
0.0

julia> zero(nt)
ForwardDiff.GradientNumber{2,Float64,Tuple{Float64,Float64}}(0.0,ForwardDiff.Partials{Float64,Tuple{Float64,Float64}}((0.0,0.0)))

jrevels · 2015-09-27T23:58:55Z

Isn't zero(eltype(dz)) wrong? eltype of a GradientNumber is just the type of it's real part, no?

dz is a Vector{G<:GradientNumber}, unless I'm mistaken.

KristofferC · 2015-09-28T00:02:18Z

Yeah, that's right. Sorry.

jrevels · 2015-09-28T00:05:53Z

Comparing this PR with master via our feeble suite of benchmarks:

Master:

julia> ack_bench = get_benchmark(ackley)
3x4 DataFrames.DataFrame
| Row | time        | func | xlen | chunk_size |
|-----|-------------|------|------|------------|
| 1   | 0.000118116 | 'f'  | 5000 | -1         |
| 2   | 0.867284    | 'g'  | 5000 | 1          |
| 3   | 0.208063    | 'g'  | 5000 | 5          |

julia> ros_bench = get_benchmark(rosenbrock)
3x4 DataFrames.DataFrame
| Row | time      | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1   | 1.244e-5  | 'f'  | 5000 | -1         |
| 2   | 0.0974855 | 'g'  | 5000 | 1          |
| 3   | 0.0636509 | 'g'  | 5000 | 5          |

julia> logit_bench = get_benchmark(self_weighted_logit)
3x4 DataFrames.DataFrame
| Row | time      | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1   | 6.689e-6  | 'f'  | 5000 | -1         |
| 2   | 0.131128  | 'g'  | 5000 | 1          |
| 3   | 0.0615632 | 'g'  | 5000 | 5          |

This PR:

julia> ack_bench = get_benchmark(ackley)
3x4 DataFrames.DataFrame
| Row | time        | func | xlen | chunk_size |
|-----|-------------|------|------|------------|
| 1   | 0.000119679 | 'f'  | 5000 | -1         |
| 2   | 1.15358     | 'g'  | 5000 | 1          |
| 3   | 0.297316    | 'g'  | 5000 | 5          |

julia> ros_bench = get_benchmark(rosenbrock)
3x4 DataFrames.DataFrame
| Row | time      | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1   | 1.2382e-5 | 'f'  | 5000 | -1         |
| 2   | 0.547275  | 'g'  | 5000 | 1          |
| 3   | 0.214978  | 'g'  | 5000 | 5          |

julia> logit_bench = get_benchmark(self_weighted_logit)
3x4 DataFrames.DataFrame
| Row | time      | func | xlen | chunk_size |
|-----|-----------|------|------|------------|
| 1   | 6.736e-6  | 'f'  | 5000 | -1         |
| 2   | 0.221926  | 'g'  | 5000 | 1          |
| 3   | 0.0869565 | 'g'  | 5000 | 5          |

Under the func column, 'f' denotes the original function evaluation time, while 'g' denotes the gradient evaluation time.

I swear, the next big addition to ForwardDiff will be a real ForwardDiffBenchmarks module...with docs.

KristofferC · 2015-09-28T00:10:46Z

Putting the ret line on three different rows reduces the number of allocations with 20 millions like so:

ret = cm_1 * (cm_2 * itp[ixm_1,ixm_2] + c_2 * itp[ixm_1,ix_2] + cp_2 * itp[ixm_1,ixp_2])
ret += c_1 * (cm_2 * itp[ix_1,ixm_2] + c_2 * itp[ix_1,ix_2] + cp_2 * itp[ix_1,ixp_2])
ret += cp_1 * (cm_2 * itp[ixp_1,ixm_2] + c_2 * itp[ixp_1,ix_2] + cp_2 * itp[ixp_1,ixp_2])

KristofferC · 2015-09-28T00:27:26Z

With the above change i get 58.44e6 allocations. Doing the math:

julia> 58.44*10^6 / (length(irng) * length(jrng))
11.999159812423812

means we are allocating 12 times per call. Looking at @code_llvm on mygetindex shows

  %fx_1 = alloca %GradientNumber, align 8
  %fx_2 = alloca %GradientNumber, align 8
  %4 = alloca %GradientNumber, align 8
  %5 = alloca %GradientNumber, align 8
  %6 = alloca %GradientNumber, align 8
  %7 = alloca %GradientNumber, align 8
  %8 = alloca %GradientNumber, align 8
  %9 = alloca %GradientNumber, align 8
  %10 = alloca %GradientNumber, align 8
  %11 = alloca %GradientNumber, align 8
  %12 = alloca %GradientNumber, align 8
  %13 = alloca %GradientNumber, align 8

which are the 12 allocations.

jrevels · 2015-09-28T00:44:38Z

@KristofferC That's an interesting find, I'm seeing that as well on master. In this PR, I get a bit more:

  %fx_1 = alloca %GradientNumber, align 8
  %fx_2 = alloca %GradientNumber, align 8
  %cm_1 = alloca %GradientNumber, align 8
  %c_1 = alloca %GradientNumber, align 8
  %cp_1 = alloca %GradientNumber, align 8
  %cm_2 = alloca %GradientNumber, align 8
  %c_2 = alloca %GradientNumber, align 8
  %cp_2 = alloca %GradientNumber, align 8
  %ret = alloca %GradientNumber, align 8
  %4 = alloca %GradientNumber, align 8
  %5 = alloca %GradientNumber, align 8
  %6 = alloca %GradientNumber, align 8
  %7 = alloca %GradientNumber, align 8
  %8 = alloca %GradientNumber, align 8
  %9 = alloca %GradientNumber, align 8
  %10 = alloca %GradientNumber, align 8
  %11 = alloca %GradientNumber, align 8
.
.
.
  %60 = alloca %GradientNumber, align 8
  %61 = alloca %GradientNumber, align 8
  %62 = alloca %GradientNumber, align 8

Note that both of my checks for these had the ret optimization you recently posted. Comparing timings:

Master:

julia> @time mygetindex(itpcoefs, gi, gj)
  0.000003 seconds (17 allocations: 576 bytes)
ForwardDiff.GradientNumber{1,Float64,Tuple{Float64}}(0.7376966647789868,ForwardDiff.Partials{Float64,Tuple{Float64}}((0.0,)))

This PR:

julia> @time mygetindex(itpcoefs, gi, gj)
  0.000005 seconds (5 allocations: 192 bytes)
ForwardDiff.GradientNumber{1,Float64,Tuple{Float64}}(0.2262124251183511,ForwardDiff.Partials{Float64,Tuple{Float64}}((0.0,)))

with DualNumbers:

julia> @time mygetindex(itpcoefs, di, dj)
  0.000004 seconds (5 allocations: 192 bytes)
0.7376966647789868 + 0.0du

KristofferC · 2015-09-28T00:50:32Z

I guess I was wrong with that the 12 allocations was from alloca then? I read some in the LLVM manual and alloca seem to be allocating at the stack and @time shows heap allocations?

timholy · 2015-09-28T21:38:45Z

I got so busy with all the work that's going into JuliaLang/METADATA.jl#3544 that I had to leave this for a while, but I just put together another test script, see the "perfFD3.jl" file in that same gist. Here are my results:

julia> include("perfFD3.jl")
Warm up @time:
  0.000004 seconds (148 allocations: 10.151 KB)
1D linear, Interpolations
  0.016062 seconds (15 allocations: 624 bytes)
1d linear, hand implementation
  0.027657 seconds (15 allocations: 624 bytes)
1d quadratic, Interpolations
  0.103341 seconds (4.00 M allocations: 122.071 MB, 16.80% gc time)
2D linear
  0.029129 seconds (15 allocations: 624 bytes)
2D quadratic
  0.353004 seconds (16.00 M allocations: 488.282 MB, 20.86% gc time)
3D linear
  0.058516 seconds (15 allocations: 688 bytes)
3D quadratic
  1.159985 seconds (52.00 M allocations: 2.325 GB, 17.65% gc time)

In 1d, the difference between linear and quadratic interpolation is just

c*v + cp*vp

vs

cm*vm + c*v + cp*vp

so it really looks like there's some threshold in how the compiler elides allocations.

timholy · 2015-09-28T21:45:43Z

Woot! Adding parentheses in various places eliminates the allocation. See the latest file added to the gist, for quadratic interpolation I now get

julia> include("perfFD4.jl")
  0.082390 seconds (15 allocations: 624 bytes)

jrevels · 2015-09-28T22:07:07Z

I'm seeing the same on my machine. That is to say:

linear perfFD3.jl calculations ---> master beats this PR
quad perfFD3.jl calculations ---> this PR beats master
perfFD4.jl calculation ---> master beats this PR

Thanks for all the investigative work here, @timholy & @KristofferC.

timholy · 2015-09-28T22:08:49Z

Cross-ref JuliaLang/julia#13350

timholy · 2015-09-28T22:15:09Z

Nice, it's now within ~2x of hand-written gradients (which is perhaps not surprising since it's got 2 parameters). I'll look forward to seeing how this scales with a little more testing.

Disable forced inlining in arithmetic

e38fb29

timholy mentioned this pull request Sep 28, 2015

Inlining arithmetic can produce slowdowns JuliaLang/julia#13350

Closed

timholy closed this Sep 28, 2015

timholy deleted the pull-request/e38fb297 branch September 28, 2015 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable forced inlining in arithmetic #58

Disable forced inlining in arithmetic #58

timholy commented Sep 26, 2015

mlubin commented Sep 26, 2015

jrevels commented Sep 26, 2015

KristofferC commented Sep 26, 2015

timholy commented Sep 26, 2015

timholy commented Sep 27, 2015

timholy commented Sep 27, 2015

mlubin commented Sep 27, 2015

timholy commented Sep 27, 2015

timholy commented Sep 27, 2015

timholy commented Sep 27, 2015

jrevels commented Sep 27, 2015

jrevels commented Sep 27, 2015

timholy commented Sep 27, 2015

mlubin commented Sep 27, 2015

timholy commented Sep 27, 2015

timholy commented Sep 27, 2015

KristofferC commented Sep 27, 2015

jrevels commented Sep 27, 2015

KristofferC commented Sep 28, 2015

jrevels commented Sep 28, 2015

KristofferC commented Sep 28, 2015

KristofferC commented Sep 28, 2015

jrevels commented Sep 28, 2015

KristofferC commented Sep 28, 2015

timholy commented Sep 28, 2015

timholy commented Sep 28, 2015

jrevels commented Sep 28, 2015

timholy commented Sep 28, 2015

timholy commented Sep 28, 2015

Disable forced inlining in arithmetic #58

Disable forced inlining in arithmetic #58

Conversation

timholy commented Sep 26, 2015

mlubin commented Sep 26, 2015

jrevels commented Sep 26, 2015

KristofferC commented Sep 26, 2015

timholy commented Sep 26, 2015

timholy commented Sep 27, 2015

timholy commented Sep 27, 2015

mlubin commented Sep 27, 2015

timholy commented Sep 27, 2015

timholy commented Sep 27, 2015

timholy commented Sep 27, 2015

jrevels commented Sep 27, 2015

jrevels commented Sep 27, 2015

timholy commented Sep 27, 2015

mlubin commented Sep 27, 2015

timholy commented Sep 27, 2015

timholy commented Sep 27, 2015

KristofferC commented Sep 27, 2015

jrevels commented Sep 27, 2015

KristofferC commented Sep 28, 2015

jrevels commented Sep 28, 2015

KristofferC commented Sep 28, 2015

KristofferC commented Sep 28, 2015

jrevels commented Sep 28, 2015

KristofferC commented Sep 28, 2015

timholy commented Sep 28, 2015

timholy commented Sep 28, 2015

jrevels commented Sep 28, 2015

timholy commented Sep 28, 2015

timholy commented Sep 28, 2015