Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlockMap enhancements #72

Merged
merged 21 commits into from
Jan 13, 2020
Merged

BlockMap enhancements #72

merged 21 commits into from
Jan 13, 2020

Conversation

dkarrasch
Copy link
Member

@dkarrasch dkarrasch commented Oct 16, 2019

  • add 5-arg mul! for BlockMaps
  • use Base.@propagate_inbounds in BlockMap-multiplication
  • add overhead-free BlockDiagonalMap
  • add 5-arg mul!(::AbstractMatrix, ::BlockMap, ::AbstractMatrix, alpha, beta)

@coveralls
Copy link

coveralls commented Oct 16, 2019

Coverage Status

Coverage increased (+1.8%) to 95.972% when pulling 0857943 on blockdiag into a0f5016 on master.

@codecov
Copy link

codecov bot commented Oct 16, 2019

Codecov Report

Merging #72 into master will decrease coverage by 0.18%.
The diff coverage is 96.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
- Coverage   96.67%   96.48%   -0.19%     
==========================================
  Files          10       10              
  Lines         631      655      +24     
==========================================
+ Hits          610      632      +22     
- Misses         21       23       +2
Impacted Files Coverage Δ
src/LinearMaps.jl 89.39% <100%> (+1.7%) ⬆️
src/wrappedmap.jl 100% <100%> (ø) ⬆️
src/blockmap.jl 96.98% <95.58%> (-1.61%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0f5016...0857943. Read the comment docs.

require_one_based_indexing(y, x)
m, n = size(A)
@boundscheck (m == length(y) && n == length(x)) || throw(DimensionMismatch("A_mul_B!"))
@inline function _blockmul!(y, A::BlockMap, x, α, β)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be the A_mul_B! code, which can be easily generalized to work with α, β, and matrices instead of vectors, so I factored it out. The generic version of indexing is then selectdim(y, 1, ...).

require_one_based_indexing(y, x)
m, n = size(A)
@boundscheck (n == length(y) && m == length(x)) || throw(DimensionMismatch("At_mul_B!"))
@inline function _transblockmul!(y, A::BlockMap, x, α, β, transform)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is corresponding multiplication code we used to have twice, once for At_mul_B! and once for Ac_mul_B!. Otherwise, this is the analogous generalization to alpha, beta, and matrices.

Comment on lines +344 to +360
Base.@propagate_inbounds A_mul_B!(y::AbstractVector, A::BlockMap, x::AbstractVector) =
mul!(y, A, x)

Base.@propagate_inbounds A_mul_B!(y::AbstractVector, A::TransposeMap{<:Any,<:BlockMap}, x::AbstractVector) =
mul!(y, A, x)

Base.@propagate_inbounds At_mul_B!(y::AbstractVector, A::BlockMap, x::AbstractVector) =
mul!(y, transpose(A), x)

Base.@propagate_inbounds A_mul_B!(y::AbstractVector, A::AdjointMap{<:Any,<:BlockMap}, x::AbstractVector) =
mul!(y, A, x)

Base.@propagate_inbounds Ac_mul_B!(y::AbstractVector, A::BlockMap, x::AbstractVector) =
mul!(y, adjoint(A), x)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have multiplication handled by mul!s, and nothing else.

Comment on lines +160 to +175
m = 5; n = 6
M1 = 10*(1:m) .+ (1:(n+1))'; L1 = LinearMap(M1)
M2 = randn(elty, m, n+2); L2 = LinearMap(M2)
M3 = randn(elty, m, n+3); L3 = LinearMap(M3)

# Md = diag(M1, M2, M3, M2, M1) # unsupported so use sparse:
Md = Matrix(blockdiag(sparse.((M1, M2, M3, M2, M1))...))
x = randn(elty, size(Md, 2))
Bd = @inferred blockdiag(L1, L2, L3, L2, L1)
@test Matrix(@inferred blockdiag(L1)) == M1
@test Matrix(@inferred blockdiag(L1, L2)) == blockdiag(sparse.((M1, M2))...)
Bd2 = @inferred cat(L1, L2, L3, L2, L1; dims=(1,2))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should say that these tests are shamelessly stolen from @JeffFessler's PR #65. 😉

@dkarrasch dkarrasch requested a review from Jutho October 22, 2019 07:05
@dkarrasch
Copy link
Member Author

With the uniform scaling changes in the Kronecker PR, this makes (5-arg) multiplication of BlockMaps built from AbstractMatrix and UniformScaling objects with vectors and even with matrices essentially allocation-free, up to the generation of views. I read somewhere that taking views might be handled by the compiler in an allocation-free way by Julia v1.4!

This is ready for review. I guess once #61 and this one are merged, we should release and let people enjoy. 😃

@dkarrasch
Copy link
Member Author

dkarrasch commented Oct 22, 2019

Hm, at some point we should experiment with the multi-threading features. I did some benchmarking, adopting benchmark tests from JuliaArrays/BlockDiagonals.jl#26 (comment). EDIT: I should say that, over at BlockDiagonals.jl, there is recent effort to speed-up multiplication in JuliaArrays/BlockDiagonals.jl#26, so their runtimes below are not final.

using LinearAlgebra, LinearMaps, BlockDiagonals, BenchmarkTools, Test, SparseArrays

for nblocks in (2, 15, 20, 200)
    @show nblocks
    As = [rand(10, 10) for _ in 1:nblocks]
    L = cat(LinearMap.(As)...; dims=(1,2))
    B = BlockDiagonal(As)
    A = rand(size(B)...)
    Y = zeros(size(B))
    M = Matrix(B)
    D = Diagonal(As)
    A´ = [rand(10, nblocks*10) for _ in 1:nblocks]

    println("BlockDiag * Matrix")
    @btime @inbounds mul!($Y, $L, $A)
    @btime hcat($D * $A´)
    @btime $B * $A
    @btime $M * $A
end

The results for the PR as is are:

nblocks = 2
BlockDiag * Matrix
  1.222 μs (4 allocations: 256 bytes)
  1.089 μs (7 allocations: 3.80 KiB)
  3.955 μs (24 allocations: 10.91 KiB)
  1.065 μs (1 allocation: 3.25 KiB)
nblocks = 15
BlockDiag * Matrix
  55.998 μs (30 allocations: 1.88 KiB)
  43.004 μs (20 allocations: 178.61 KiB)
  96.291 μs (142 allocations: 535.28 KiB)
  73.585 μs (2 allocations: 175.89 KiB)
nblocks = 20
BlockDiag * Matrix
  108.166 μs (40 allocations: 2.50 KiB)
  74.857 μs (25 allocations: 315.55 KiB)
  164.571 μs (187 allocations: 946.69 KiB)
  143.115 μs (2 allocations: 312.58 KiB)
nblocks = 200
BlockDiag * Matrix
  19.944 ms (400 allocations: 25.00 KiB)
  7.109 ms (405 allocations: 30.54 MiB)
  23.854 ms (2504 allocations: 91.63 MiB)
  111.515 ms (2 allocations: 30.52 MiB)

Note the small number of allocations, exactly two per block, corresponding to the views into x and y, respectively. The clear winner (in terms of producing numbers; of course, the application case and the exact arrangement of the numbers is different) is LinearAlgebras Diagonal. Its multiplication is defined as simply D.diag .* v, so it uses broadcasting. I was then wondering why the hell this is so much faster, given that all we do is go through the blocks and do in-place mul! into the corresponding views. I suspected that broadcasting may use, at the very bottom, multi-threading (hm, I just tested, it's as fast when there is only one thread 🤔 ), so I thought I'd give it a try with multi-threading in LinearMaps.jl. And actually, it turns out bloody easy, simply add a Threads.@threads in front of the loop. Here are the results:

nblocks = 2
BlockDiag * Matrix
  16.870 μs (58 allocations: 3.72 KiB)
  1.093 μs (7 allocations: 3.80 KiB)
  3.702 μs (24 allocations: 10.91 KiB)
  1.078 μs (1 allocation: 3.25 KiB)
  3.699 μs (1 allocation: 3.25 KiB)
nblocks = 15
BlockDiag * Matrix
  45.045 μs (240 allocations: 9.20 KiB)
  42.618 μs (20 allocations: 178.61 KiB)
  95.217 μs (142 allocations: 535.28 KiB)
  73.985 μs (2 allocations: 175.89 KiB)
  191.856 μs (2 allocations: 175.89 KiB)
nblocks = 20
BlockDiag * Matrix
  63.837 μs (310 allocations: 11.31 KiB)
  75.688 μs (25 allocations: 315.55 KiB)
  165.250 μs (187 allocations: 946.69 KiB)
  144.712 μs (2 allocations: 312.58 KiB)
  340.868 μs (2 allocations: 312.58 KiB)
nblocks = 200
BlockDiag * Matrix
  9.702 ms (3430 allocations: 96.63 KiB)
  7.124 ms (405 allocations: 30.54 MiB)
  23.140 ms (2504 allocations: 91.63 MiB)
  110.130 ms (2 allocations: 30.52 MiB)
  39.654 ms (2 allocations: 30.52 MiB)

This is with 4 threads. So we are able to kind of catch up with the fast Diagonal multiplication, at the expense of more allocations and quite some overhead in the case of small number of blocks on the diagonal. Anyway, I think this will be exciting to optimize over concrete use cases. @chriscoey

Another aspect that I realized: when there are 16 or more LinearMaps involved, then the eltype of the resulting BlockDiagonalMap is no longer inferred. I guess the same applies to LinearCombinations etc.

@dkarrasch
Copy link
Member Author

Would be great if we could get this one in, as well. The README already promises a new minor release. 😂

@dkarrasch
Copy link
Member Author

This is one ready for review.

@dkarrasch
Copy link
Member Author

@Jutho If you find the time, can you take a look? On the other hand, there is little "controversial" stuff in here and I have reviewed it myself many times 😂 , so I'd merge soon.

@dkarrasch dkarrasch merged commit 039e35f into master Jan 13, 2020
@dkarrasch dkarrasch deleted the blockdiag branch January 13, 2020 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants