-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting Tropical numbers #201
Comments
The easiest way to get this to work would be to swap regular numbers with tropical numbers, and leave LoopVectorization oblivious. This should be fine, because operations like LoopVectorization works internally with SIMD types. So you'll have to make sure all the basic operations work correctly. julia> using TropicalNumbers, VectorizationBase
julia> vx = Vec(ntuple(_ -> randn(), VectorizationBase.pick_vector_width(Float64))...)
Vec{8, Float64}<-0.4542360436646172, 0.808469891292917, 0.3402188425112263, -0.9854732349028669, -0.5457866173882676, 0.5384574165868918, -0.4639713295159042, 0.7095047672139283>
julia> vy = Vec(ntuple(_ -> randn(), VectorizationBase.pick_vector_width(Float64))...)
Vec{8, Float64}<-0.9020073240658397, 0.2134128364652452, -0.2290577287345638, -0.7468730791131707, 0.5319348585990041, -0.32909273153665797, -0.7241752504699737, -0.4699665265194333>
julia> Tropical(vx) + Tropical(vy)
Tropical(Vec{8, Float64}<-0.4542360436646172, 0.808469891292917, 0.3402188425112263, -0.7468730791131707, 0.5319348585990041, 0.5384574165868918, -0.4639713295159042, 0.7095047672139283>)
julia> Tropical(vx) * Tropical(vy)
Tropical(Vec{8, Float64}<-1.3562433677304568, 1.0218827277581621, 0.11116111377666252, -1.7323463140160376, -0.01385175878926348, 0.20936468505023387, -1.188146579985878, 0.23953824069449497>) Thankfully, it looks like this is already handled correctly. So the trickier part will be getting loads/stores from memory to work. I'd recommend defining a struct TropicalStridedPointer{T,N,C,B,R,X,O,SP <: VectorizationBase.AbstractStridedPointer{T,N,C,B,R,X,O} } <: VectorizationBase.AbstractStridedPointer{T,N,C,B,R,X,O}
sp::SP
end
function VectorizationBase.stridedpointer(A::AbstractArray{Tropical{T}}) where {T}
TropicalStridedPointer(stridedpointer(reinterpret(T, A)))
end
@inline function VectorizationBase.stridedpointer(
ptr::Ptr{Tropical{T}}, ::StaticInt{C}, ::StaticInt{B}, ::Val{R}, strd::X, offsets::O
) where {T<:NativeTypesExceptBit,C,B,R,N,X<:Tuple{Vararg{Integer,N}},O<:Tuple{Vararg{Integer,N}}}
TropicalStridedPointer(StridedPointer{T,N,C,B,R,X,O}(Base.unsafe_convert(Ptr{T}, ptr), strd, offsets))
end and then working on overloads for A = reshape(collect(Tropical(Float64(0)):Tropical(Float64(prod(dims)-1))), dims); and still have the following load/store tests pass, and of course make sure the loaded vectors are Note that it was LoopVectorization will also wrap these these tropical vectors with various types, like Once all this works, you'll also need to define this method: LoopVectorization.check_type(::Type{Tropical{T}}) where {T} = LoopVectorization.check_type(T) If this returns false, LoopVectorization will silently use a slow fallback just applying ( That's all that immediately comes to mind. From there, you can try simple problems first, like a dot product: function mydot(a,b)
s = zero(promote_type(eltype(a),eltype(b)))
@avx for i in eachindex(a,b)
s += a[i]*b[i]
end
s
end Once this works, you can try matrix multiply examples, e.g simple loops, and then finally Octavian. Note that Octavian currently acts really badly if it gets an error in multithreaded code. You'll probably have to restart Julia if you do, and might even need to use a task manager to kill all the threads (which could keep going after you exit Julia). I hope that helps, please let me know if you have any questions/run into any issues. |
Hi, @chriselrod Thank you very much for your help. Here is a gist following your instruction. It is easier than I expected and the results of https://gist.github.com/GiggleLiu/a6d2bed21731fa344f4d7c1660f35952 However, the results are regular matmul rather than tropical matmul. It seems the program does not recognize the element type. Do you know how to fix? (EDIT: I notice that I forgot to overload the |
I would add an @test LoopVectorization.check_args(a, b) to make sure it returns A couple examples to try: vload(stridedpointer(a), (MM{4}(1),1)) # load a[1:4,1]
vload(stridedpointer(a), (1,MM{4}(1))) # load a[1,1:4]
vload(stridedpointer(a), Unroll{2,1,3,1,4}((1,1))) # load a[1:4,1:3] If julia> a = rand(4,4)
4×4 Matrix{Float64}:
0.199901 0.737377 0.535227 0.423361
0.0957741 0.398867 0.574244 0.898327
0.139651 0.878405 0.316808 0.31598
0.283851 0.267324 0.975793 0.326679
julia> vload(stridedpointer(a), (MM{4}(1),1)) # load a[1:4,1]
Vec{4, Float64}<0.199900943073283, 0.0957741305262807, 0.13965061919067523, 0.2838507750142387>
julia> vload(stridedpointer(a), (1,MM{4}(1))) # load a[1,1:4]
Vec{4, Float64}<0.199900943073283, 0.7373773483950572, 0.5352265912879297, 0.42336121283806105>
julia> vload(stridedpointer(a), Unroll{2,1,3,1,4}((1,1))) # load a[1:4,1:3]
3 x Vec{4, Float64}
Vec{4, Float64}<0.199900943073283, 0.0957741305262807, 0.13965061919067523, 0.2838507750142387>
Vec{4, Float64}<0.7373773483950572, 0.39886661759802355, 0.8784048837196166, 0.2673242656533237>
Vec{4, Float64}<0.5352265912879297, 0.5742437648414183, 0.31680796319268056, 0.975792905256238> With your |
@chriselrod Thanks for your instructions. I feel now I am very close to implementing it correctly. The most exciting thing is: the tests now passes! Here is the gist: https://gist.github.com/GiggleLiu/a6d2bed21731fa344f4d7c1660f35952 But,it requires several changes:
-abstract type AbstractStridedPointer{T<:NativeTypes,N,C,B,R,X<:Tuple{Vararg{Any,N}},O<:Tuple{Vararg{Any,N}}} end
+abstract type AbstractStridedPointer{T,N,C,B,R,X<:Tuple{Vararg{Any,N}},O<:Tuple{Vararg{Any,N}}} end
# TODO: FIX!!!!!!
function Base.promote(a::Int, b::Tropical{Vec{4,Float64}})
elem = a == 0 ? -Inf : 0.0
@show a
Tropical(Vec(elem, elem, elem, elem)), b
end
function Base.promote(a::Int, b::Tropical{Vec{4,Float64}}, c::Tropical{Vec{4,Float64}})
elem = a == 0 ? -Inf : 0.0
@show a
Tropical(Vec(elem, elem, elem, elem)), b, c
end There is no promotion relation between The error message is Error During Test at /home/leo/jcode/lab/tropicalblas.jl:88
Test threw exception
Expression: Octavian.matmul_serial(a, b) ≈ a * b
promotion of types Int64 and Tropical{Vec{4, Float64}} failed to change any arguments
Stacktrace:
[1] error(::String, ::String, ::String)
@ Base ./error.jl:42
[2] sametype_error(input::Tuple{Int64, Tropical{Vec{4, Float64}}})
@ Base ./promotion.jl:316
[3] not_sametype(x::Tuple{Int64, Tropical{Vec{4, Float64}}}, y::Tuple{Int64, Tropical{Vec{4, Float64}}})
@ Base ./promotion.jl:310
[4] promote
@ ./promotion.jl:293 [inlined]
[5] mul_fast(::Int64, ::Tropical{Vec{4, Float64}})
@ Base.FastMath ./fastmath.jl:267
[6] macro expansion
@ ~/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:526 [inlined]
[7] _avx_!(#unused#::Val{(0, 0, 0, 4, 32, 15, 64)}, #unused#::Val{(:numericconstant, Symbol("###zero###9###"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000003, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000023, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x02), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000031, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x03), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000231, 0x0000000000000003, 0x0000000000000000, 0x0000000000020301, LoopVectorization.compute, 0x00, 0x01), :LoopVectorization, :identity, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000000004, LoopVectorization.compute, 0x00, 0x01), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x06), :LoopVectorization, :mul_fast, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000708, LoopVectorization.compute, 0x00, 0x07), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000060509, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000003, 0x0000000000000000, 0x000000000000000a, LoopVectorization.memstore, 0x03, 0x08))}, #unused#::Val{(LoopVectorization.ArrayRefStruct{:A, Symbol("##vptr##_A")}(0x0000000000000101, 0x0000000000000203, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:B, Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000301, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:C, Symbol("##vptr##_C")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000))}, #unused#::Val{(0, (), (6, 7), (), (), ((1, LoopVectorization.IntOrFloat),), ())}, #unused#::Val{(:n, :m, :k)}, _vargs::Tuple{Tuple{LoopVectorization.CloseOpen{StaticInt{0}, Int64}, LoopVectorization.CloseOpen{StaticInt{0}, Int64}, LoopVectorization.CloseOpen{StaticInt{0}, Int64}}, Tuple{VectorizationBase.StridedPointer{TropicalF64, 2, 1, 0, (1, 2), Tuple{StaticInt{8}, Int64}, Tuple{StaticInt{0}, StaticInt{0}}}, VectorizationBase.StridedPointer{TropicalF64, 2, 1, 0, (1, 2), Tuple{StaticInt{8}, Int64}, Tuple{StaticInt{0}, StaticInt{0}}}, VectorizationBase.StridedPointer{TropicalF64, 2, 1, 0, (1, 2), Tuple{StaticInt{8}, Int64}, Tuple{StaticInt{0}, StaticInt{0}}}, StaticInt{1}, StaticInt{0}}})
@ LoopVectorization ~/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:526
[8] loopmul!
@ ~/.julia/dev/Octavian/src/macrokernels.jl:83 [inlined]
[9] _matmul_serial!
@ ~/.julia/dev/Octavian/src/matmul.jl:162 [inlined]
[10] matmul_serial(A::Matrix{TropicalF64}, B::Matrix{TropicalF64})
@ Octavian ~/.julia/dev/Octavian/src/matmul.jl:103
[11] top-level scope
@ ~/jcode/lab/tropicalblas.jl:88
[12] eval
@ ./boot.jl:360 [inlined]
[13] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1090
[14] invokelatest(::Any, ::Any, ::Vararg{Any, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Base ./essentials.jl:707
[15] invokelatest(::Any, ::Any, ::Vararg{Any, N} where N)
@ Base ./essentials.jl:706
[16] inlineeval(m::Module, code::String, code_line::Int64, code_column::Int64, file::String; softscope::Bool)
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:192
[17] (::VSCodeServer.var"#63#67"{Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams})()
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:150
[18] withpath(f::VSCodeServer.var"#63#67"{Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams}, path::String)
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/repl.jl:135
[19] (::VSCodeServer.var"#62#66"{Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams})()
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:148
[20] hideprompt(f::VSCodeServer.var"#62#66"{Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams})
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/repl.jl:36
[21] (::VSCodeServer.var"#61#65"{Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams})()
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:116
[22] with_logstate(f::Function, logstate::Any)
@ Base.CoreLogging ./logging.jl:491
[23] with_logger
@ ./logging.jl:603 [inlined]
[24] (::VSCodeServer.var"#60#64"{VSCodeServer.ReplRunCodeRequestParams})()
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:182
[25] #invokelatest#2
@ ./essentials.jl:707 [inlined]
[26] invokelatest(::Any)
@ Base ./essentials.jl:706
[27] macro expansion
@ ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:34 [inlined]
[28] (::VSCodeServer.var"#58#59")()
@ VSCodeServer ./task.jl:406
ERROR: LoadError: There was an error during testing
in expression starting at /home/leo/jcode/lab/tropicalblas.jl:88 Thanks again! |
I did some benchmark, find some more issues. The output sometimes contains some NaNs, it only appears in large matrices. When it does not contain Nan, the result is correct. It does no behave consistently when I run the function multiple times, sometime error and sometime correct. This is so wield, what can be the possible causes? julia> Octavian.matmul_serial(a, b)
20×20 Matrix{TropicalF64}:
Tropical(NaN) Tropical(4.024269734839327) Tropical(4.26327341398722) … Tropical(4.204939545709253) Tropical(NaN) Tropical(4.776889810790789)
Tropical(NaN) Tropical(2.194497613870535) Tropical(2.058543360062661) Tropical(3.5946191821421216) Tropical(NaN) Tropical(3.0040889036978973)
Tropical(2.9011941950744964) Tropical(2.1627096262173797) Tropical(2.8782146191689657) Tropical(3.8420567150697646) Tropical(NaN) Tropical(2.152284655932344)
Tropical(NaN) Tropical(1.7681295634003424) Tropical(2.921440672411346) Tropical(2.2011623659906308) Tropical(NaN) Tropical(1.9797996146204884)
Tropical(NaN) Tropical(2.38970559790015) Tropical(3.544039189063668) Tropical(2.267337659239235) Tropical(NaN) Tropical(2.818109225827046)
Tropical(2.9752555409299384) Tropical(4.297949125446652) Tropical(3.6685467944558394) … Tropical(4.1397369630940934) Tropical(NaN) Tropical(3.352553282694595)
Tropical(2.397845997221319) Tropical(1.8569225628500599) Tropical(1.2275202318592475) Tropical(2.8172671791289803) Tropical(NaN) Tropical(1.4759796716922038)
Tropical(2.1965028088271117) Tropical(2.140797817614342) Tropical(3.1723901113113544) Tropical(2.434120168918318) Tropical(NaN) Tropical(2.230749053520497)
⋮ ⋱
Tropical(NaN) Tropical(2.730009619936266) Tropical(2.553235863336632) Tropical(3.876715789552953) Tropical(NaN) Tropical(3.153954560543511)
Tropical(NaN) Tropical(2.6111525663946153) Tropical(2.448263484633223) Tropical(2.2721453419157123) Tropical(NaN) Tropical(2.8627405231265053)
Tropical(NaN) Tropical(2.616236581455606) Tropical(2.887718641691456) … Tropical(2.833083099894104) Tropical(NaN) Tropical(2.308935514119695)
Tropical(2.3711691743062926) Tropical(1.39657464107459) Tropical(2.0919814326461355) Tropical(NaN) Tropical(NaN) Tropical(1.713597248742208)
Tropical(1.1941119905234814) Tropical(2.1598929755724314) Tropical(2.8373250789251263) Tropical(NaN) Tropical(NaN) Tropical(1.8250873513607369)
Tropical(1.8801722402476346) Tropical(2.0994643359840515) Tropical(1.6684035006197018) Tropical(NaN) Tropical(NaN) Tropical(NaN)
Tropical(1.289189787032358) Tropical(2.9596107706634043) Tropical(2.1714404638502094) Tropical(NaN) Tropical(NaN) Tropical(NaN) BTW: i see a speed up like ~15 for a 1000 x 1000 matrix! Even with the above patch. julia> a = Tropical.(randn(1000, 1000));
julia> @time Octavian.matmul_serial(a, a);
0.358573 seconds (2 allocations: 7.629 MiB)
julia> @time a * a;
5.164423 seconds (8 allocations: 7.630 MiB, 0.55% gc time) This is a speed that I never imagined, |
Feel free to make a PR. Taking a real quick look, instead of: using Base.Cartesian: @nexprs
@generated function VectorizationBase.fma(x::Tropical{Vec{N,T}}, y::Tropical{Vec{N,T}}, z::Tropical{Vec{N,T}}) where {N,T}
Expr(:call, :Tropical, Expr(:call, :Vec, [:(max(content(z).data[$i].value, content(x).data[$i].value+content(y).data[$i].value)) for i=1:N]...))
end do @inline VectorizationBase.fma(x::Tropical{V}, y::Tropical{V}, z::Tropical{V}) where {V<:AbstractSIMD} = Tropical(max(z, x + y)) Also: function VectorizationBase._vzero(in1::StaticInt{W}, ::Type{T}, in3::StaticInt{RS}) where {W,T<:Tropical{FT},RS} where FT
Tropical(VectorizationBase._vzero(in1, FT, in3))
end This should probably be function VectorizationBase._vzero(in1::StaticInt{W}, ::Type{T}, ::StaticInt{RS}) where {W,T<:Tropical{FT},RS} where FT
Tropical(VectorizationBase._vbroadcast(StaticInt{W}(), FT(-Inf), StaticInt{RS}()))
end I'll hopefully look into the zeros and
Abig = reshape(collect(range(1e4, step=1e3,length=100^2)), (100,100)); # 100 x 100 parent
A = view(A, 41:60, 41:60); # A is 20 x 20
A .= rand.();
Bbig = reshape(collect(range(1e16, step=1e15,length=100^2)), (100,100)); # 100 x 100 parent, much bigger scale
B = view(B, 41:60, 41:60); # A is 20 x 20
B .= rand.();
Cbig = fill(-123456.789, 100, 100);
C = view(C, 41:60, 41:60);
Octavian.matmul_serial!(C, A, B)
@test C ≈ A * B # make sure answers are correct
C .= -123456.789;
@test all(==(-123456.789), Cbig) #make sure everything outside of `C` is still `-123456.789` Now Also, as a heads up, LoopVectorization will be moving to using If you want to document what it takes and add some integration tests to show what it takes to add support to a custom type, that could be a great contribution! Finally, I think it should be a lot faster than 0.35 seconds for a 1000x1000 matrix. julia> using Octavian
julia> A = rand(1000,1000);
julia> @time Octavian.matmul_serial(A,A); # time to first matmul is bad
21.730979 seconds (22.93 M allocations: 1.511 GiB, 3.31% gc time)
julia> @time Octavian.matmul_serial(A,A);
0.020565 seconds (2 allocations: 7.629 MiB)
julia> @time Octavian.matmul_serial(A,A);
0.020673 seconds (2 allocations: 7.629 MiB)
julia> @time Octavian.matmul_serial(A,A);
0.020467 seconds (2 allocations: 7.629 MiB) If performance for |
A quick update: The resulting matrix does not contain @inline function VectorizationBase.fma(x::Tropical{V}, y::Tropical{V}, z::Tropical{V}) where {V<:VectorizationBase.AbstractSIMD}
Tropical(max(content(z), content(x) + content(y)))
end But it is still interesting to kown the cause in order to make it work in more general cases. It is also faster with the above fixes. julia> A = rand(1000,1000);
julia> @btime A * A; # mkl
34.047 ms (2 allocations: 7.63 MiB)
julia> @btime A * A; # openblas
155.493 ms (2 allocations: 7.63 MiB)
julia> @btime Octavian.matmul_serial(A,A);
39.028 ms (2 allocations: 7.63 MiB)
julia> a = Tropical.(randn(1000, 1000));
julia> @btime Octavian.matmul_serial(a, a);
138.467 ms (2 allocations: 7.63 MiB) I think it should be faster, because the naive (reordered) for loop is only 3.3x slower in the Tropical number case, but is 4.8x slower in the Float64 case. julia> function naivemm!(o::Matrix, a::Matrix, b::Matrix)
@assert size(a, 2) == size(b, 1) && size(o) == (size(a, 1), size(b, 2))
for j=1:size(b, 2)
for k=1:size(a, 2)
for i=1:size(a, 1)
@inbounds o[i,j] += a[i,k] * b[k,j]
end
end
end
return o
end
julia> @btime naivemm!(zero(A), A, A);
188.712 ms (2 allocations: 7.63 MiB)
julia> @btime naivemm!(zero(a), a, a);
461.117 ms (2 allocations: 7.63 MiB) I did a profile julia> Profile.print(mincount=10; format=:flat)
Count Overhead File Line Function
===== ======== ==== ==== ========
469 0 @Base/Base.jl 39 eval
15 15 @Base/bool.jl 37 |
469 0 @Base/boot.jl 360 eval
469 0 @Base/essentials.jl 707 #invokelatest#2
469 0 @Base/essentials.jl 706 invokelatest(::Any)
469 0 @Base/logging.jl 603 with_logger
469 0 @Base/logging.jl 491 with_logstate(f::Function, logstate::Any)
15 0 @Base/operators.jl 328 <=
469 0 @Base/task.jl 406 (::VSCodeServer.var"#58#59")()
465 0 @LoopVectorization/src/reconstruct_loopset.jl 526 _avx_!(#unused#::Val{(0, 0, 0, 4, 32, 15, 64)}, #unused#::Val{(:numericconstant, Symbol("###z...
469 0 @LoopVectorization/src/reconstruct_loopset.jl 526 macro expansion
458 0 @Octavian/src/macrokernels.jl 83 loopmul!
458 0 @Octavian/src/macrokernels.jl 248 packaloopmul!
469 0 @Octavian/src/matmul.jl 165 _matmul_serial!
469 0 @Octavian/src/matmul.jl 120 matmul_serial!(C::Matrix{TropicalF64}, A::Matrix{TropicalF64}, B::Matrix{TropicalF64})
469 0 @Octavian/src/matmul.jl 133 matmul_serial!
57 0 @Octavian/src/matmul.jl 58 matmul_st_pack_A_and_B!(C::VectorizationBase.StridedPointer{TropicalF64, 2, 1, 0, (1, 2), Tup...
408 0 @Octavian/src/matmul.jl 60 matmul_st_pack_A_and_B!(C::VectorizationBase.StridedPointer{TropicalF64, 2, 1, 0, (1, 2), Tup...
469 0 @Octavian/src/matmul.jl 178 matmul_st_pack_dispatcher!
123 0 @VectorizationBase/src/base_defs.jl 82 +
209 0 @VectorizationBase/src/base_defs.jl 82 max
123 123 @VectorizationBase/src/llvm_intrin/binary_ops.jl 62 macro expansion
123 0 @VectorizationBase/src/llvm_intrin/binary_ops.jl 62 vadd
330 0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl 269 _vfmadd_fast
209 209 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl 192 macro expansion
330 0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl 252 vfma_fast
330 0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl 271 vfmadd_fast
209 0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl 192 vmax
10 10 @VectorizationBase/src/llvm_intrin/memory_addr.jl 530 macro expansion
11 0 @VectorizationBase/src/llvm_intrin/memory_addr.jl 530 vload
102 0 @VectorizationBase/src/llvm_intrin/vbroadcast.jl 33 _vbroadcast
102 102 @VectorizationBase/src/llvm_intrin/vbroadcast.jl 65 macro expansion
102 0 @VectorizationBase/src/llvm_intrin/vbroadcast.jl 69 vbroadcast
11 0 @VectorizationBase/src/strided_pointers/stridedpointers.jl 139 vload
469 0 @VSCodeServer/src/eval.jl 34 macro expansion
469 0 @VSCodeServer/src/repl.jl 93 (::VSCodeServer.var"#44#46"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
469 0 @VSCodeServer/src/repl.jl 92 (::VSCodeServer.var"#45#47"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
469 0 @VSCodeServer/src/repl.jl 117 repleval(m::Module, code::Expr, #unused#::String)
330 0 /home/leo/jcode/lab/tropicalblas.jl 41 fma
102 0 /home/leo/jcode/lab/tropicalblas.jl 31 vbroadcast
11 0 /home/leo/jcode/lab/tropicalblas.jl 27 vload
Total snapshots: 938 Comparing with Float64 julia> Profile.print(mincount=10; format=:flat)
Count Overhead File Line Function
===== ======== ==== ==== ========
155 0 @Base/Base.jl 39 eval
14 14 @Base/bool.jl 37 |
155 0 @Base/boot.jl 360 eval
155 0 @Base/essentials.jl 707 #invokelatest#2
155 0 @Base/essentials.jl 706 invokelatest(::Any)
155 0 @Base/logging.jl 603 with_logger
155 0 @Base/logging.jl 491 with_logstate(f::Function, logstate::Any)
14 0 @Base/operators.jl 328 <=
155 0 @Base/task.jl 406 (::VSCodeServer.var"#58#59")()
154 0 @LoopVectorization/src/reconstruct_loopset.jl 526 _avx_!(#unused#::Val{(0, 0, 0, 4, 32, 15, 64)}, #unused#::Val{(:numericconstant, Symbol("###z...
155 0 @LoopVectorization/src/reconstruct_loopset.jl 526 macro expansion
153 0 @Octavian/src/macrokernels.jl 83 loopmul!
153 0 @Octavian/src/macrokernels.jl 248 packaloopmul!
155 0 @Octavian/src/matmul.jl 165 _matmul_serial!
155 0 @Octavian/src/matmul.jl 120 matmul_serial!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64})
155 0 @Octavian/src/matmul.jl 133 matmul_serial!
36 0 @Octavian/src/matmul.jl 58 matmul_st_pack_A_and_B!(C::VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{S...
118 0 @Octavian/src/matmul.jl 60 matmul_st_pack_A_and_B!(C::VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{S...
155 0 @Octavian/src/matmul.jl 178 matmul_st_pack_dispatcher!
10 0 @VectorizationBase/src/base_defs.jl 82 add_fast
10 0 @VectorizationBase/src/base_defs.jl 236 vfma_fast
10 10 @VectorizationBase/src/llvm_intrin/binary_ops.jl 63 macro expansion
10 0 @VectorizationBase/src/llvm_intrin/binary_ops.jl 63 vadd_fast
82 0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl 269 _vfmadd_fast
72 72 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl 237 macro expansion
72 0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl 237 vfma_fast
82 0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl 271 vfmadd_fast
11 11 @VectorizationBase/src/llvm_intrin/memory_addr.jl 530 macro expansion
11 0 @VectorizationBase/src/llvm_intrin/memory_addr.jl 530 vload
28 0 @VectorizationBase/src/llvm_intrin/vbroadcast.jl 33 _vbroadcast
28 28 @VectorizationBase/src/llvm_intrin/vbroadcast.jl 65 macro expansion
28 0 @VectorizationBase/src/llvm_intrin/vbroadcast.jl 69 vbroadcast
11 0 @VectorizationBase/src/strided_pointers/stridedpointers.jl 139 vload
155 0 @VSCodeServer/src/eval.jl 34 macro expansion
155 0 @VSCodeServer/src/repl.jl 93 (::VSCodeServer.var"#44#46"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
155 0 @VSCodeServer/src/repl.jl 92 (::VSCodeServer.var"#45#47"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
155 0 @VSCodeServer/src/repl.jl 117 repleval(m::Module, code::Expr, #unused#::String)
Total snapshots: 310 It looks like the Tropical EDIT: the latest benchmark of tropical gemm is halved. |
Will do, also I am planning to write a blog post about it. Looking forward to the |
Hi, @chriselrod To make the Tropical blas more accessible to public, I created a repo here and have invited you as a collaborator. https://github.com/TensorBFS/TropicalGEMM.jl |
Wield, in the profile, the julia> @code_native vbroadcast(Val(4), Tropical(2.0))
.text
; ┌ @ gemm.jl:28 within `vbroadcast'
movq %rdi, %rax
; │ @ gemm.jl:29 within `vbroadcast' @ vbroadcast.jl:77
; │┌ @ vbroadcast.jl:41 within `_vbroadcast'
; ││┌ @ vbroadcast.jl:73 within `macro expansion'
vbroadcastsd (%rsi), %ymm0
; │└└
; │ @ gemm.jl:29 within `vbroadcast'
vmovaps %ymm0, (%rdi)
vzeroupper
retq
; └
julia> @code_native vbroadcast(Val(4), 2.0)
.text
; ┌ @ vbroadcast.jl:77 within `vbroadcast'
movq %rdi, %rax
; │┌ @ vbroadcast.jl:41 within `_vbroadcast'
; ││┌ @ vbroadcast.jl:73 within `macro expansion'
vbroadcastsd %xmm0, %ymm0
; │└└
vmovaps %ymm0, (%rdi)
vzeroupper
retq
; └ I checked it for many times, this behavior is consistent. |
I'll take a look and profile soon now that LoopVectorization 0.12 is out, and Octavian has been updated to use it.
Note that it should be exactly the same in the actual function, but will be a little different when called from the REPL in this way. vbroadcastsd (%rsi), %ymm0 while vbroadcastsd %xmm0, %ymm0 So it also wouldn't really be accurate to benchmark That also means when called inside a function, this line: vmovaps %ymm0, (%rdi) won't be there. Because If called inside a function, like Octavian's gemm,
Not sure what you mean here. vbroadcastsd is an assembly instruction that basically copies a value to fill the entirety of a register. |
Just summarize this dicussion a bit before closing it by TropicalGEMM. The final performance on my host is close to theoretical optimal (halves the benchmark above: #201 (comment)) julia> using TropicalNumbers, Octavian, TropicalGEMM, BenchmarkTools
julia> a = Tropical.(randn(1000, 1000));
julia> @benchmark Octavian.matmul_serial!($(zero(a)), $a, $a)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 64.938 ms (0.00% GC)
median time: 65.147 ms (0.00% GC)
mean time: 65.371 ms (0.00% GC)
maximum time: 67.139 ms (0.00% GC)
--------------
samples: 77
evals/sample: 1 The theoretical optimal is (I will summarize this project a bit later in a blog post/or maybe in a documentation) Cheer! |
julia> using Octavian, TropicalGEMM
julia> a = Tropical.(randn(1000, 1000)); c = similar(a);
julia> @benchmark Octavian.matmul_serial!($c, $a, $a) # single thread
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 31.516 ms (0.00% GC)
median time: 31.618 ms (0.00% GC)
mean time: 31.619 ms (0.00% GC)
maximum time: 31.972 ms (0.00% GC)
--------------
samples: 159
evals/sample: 1
julia> @benchmark Octavian.matmul!($c, $a, $a) # multiple threads
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.888 ms (0.00% GC)
median time: 1.965 ms (0.00% GC)
mean time: 1.982 ms (0.00% GC)
maximum time: 2.280 ms (0.00% GC)
--------------
samples: 2522
evals/sample: 1
julia> versioninfo()
Julia Version 1.7.0-DEV.707
Commit d1d0646390* (2021-03-14 18:11 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36 I have a few different CPUs I test on, those were on the i9 10980XE, which has full-speed AVX512 support. For Hence The primary difference between this CPU and the For Clock speeds are probably similar, i.e. if you didn't change any settings in the bios, your CPU will boost much higher than the 2.9 GHz. Thus this 16/8 vs 8/4 operations per instruction is why there's roughly a 2x difference in per-core performance. Also, while the new Rocket Lake CPUs have AVX512, it'll be half-rate (meaning instructions/cycle = 1, not 2) for these core arithmetic instructions, just like it is on my laptop with a Tiger Lake cpu: julia> using Octavian, TropicalGEMM
julia> a = Tropical.(randn(1000, 1000)); c = similar(a);
julia> @benchmark Octavian.matmul_serial!($c, $a, $a) # single thread
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 61.200 ms (0.00% GC)
median time: 62.718 ms (0.00% GC)
mean time: 62.722 ms (0.00% GC)
maximum time: 63.368 ms (0.00% GC)
--------------
samples: 80
evals/sample: 1
julia> @benchmark Octavian.matmul!($c, $a, $a) # multiple threads
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 22.217 ms (0.00% GC)
median time: 28.867 ms (0.00% GC)
mean time: 28.826 ms (0.00% GC)
maximum time: 57.157 ms (0.00% GC)
--------------
samples: 174
evals/sample: 1
julia> versioninfo()
Julia Version 1.7.0-DEV.713
Commit ae26fc6d5f* (2021-03-16 06:50 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
JULIA_NUM_THREADS = 8 This CPU does the same amount of operations/instruction as my 10980XE, but half the instructions/cycle. While I haven't tested an M1 Mac, I believe they will have 4 instructions/cycle, but with only 2 operations/instructions, meaning same speed per clock cycle as the i5 10400 or the i7-1165G7, but half the 10980XE (for these sorts of linear algebra operations). |
julia> @benchmark Octavian.matmul!($c, $a, $a) # multiple threads
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.888 ms (0.00% GC)
median time: 1.965 ms (0.00% GC)
mean time: 1.982 ms (0.00% GC)
maximum time: 2.280 ms (0.00% GC)
--------------
samples: 2522
evals/sample: 1 Well, you win over Titan V GPU using TropicalNumbers, CUDA, BenchmarkTools
julia> a = Tropical.(CUDA.randn(1000, 1000));
julia> @benchmark CUDA.@sync LinearAlgebra.mul!($(zero(a)), $a, $a)
BenchmarkTools.Trial:
memory estimate: 2.19 KiB
allocs estimate: 68
--------------
minimum time: 2.045 ms (0.00% GC)
median time: 2.200 ms (0.00% GC)
mean time: 2.385 ms (0.00% GC)
maximum time: 6.760 ms (0.00% GC)
--------------
samples: 2095
evals/sample: 1 Thank you for your very detailed explaination, I do not envy your machines at all. Actually I have two questions,
|
You could buy three 10980XEs for the price of one Titan V ;) . Although the Titan V will probably beat a single 10980XE for large enough arrays (e.g. 10_000 x 10_000), because it has 7.45 peak theoretical TFLOPS for
Ideally, I'd use whatever it actually runs at. julia> using LinuxPerf#master # needs the master branch
julia> foreachf(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf (generic function with 1 method)
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(Octavian.matmul_serial!, 30, c, a, a)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.95e+09 49.9% # 4.1 cycles per ns
┌ instructions 1.02e+10 75.0% # 2.6 insns per cycle
│ branch-instructions 1.41e+08 75.0% # 1.4% of instructions
└ branch-misses 2.00e+05 75.0% # 0.1% of branch instructions
┌ task-clock 9.71e+08 100.0% # 970.5 ms
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 5.95e+08 25.0% # 28.3% of dcache loads
│ L1-dcache-loads 2.10e+09 25.0%
└ L1-icache-load-misses 7.57e+04 25.0%
┌ dTLB-load-misses 1.15e+04 25.0% # 0.0% of dTLB loads
└ dTLB-loads 2.10e+09 25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Note that it says For comparison, this is my laptop: julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf!(Octavian.matmul_serial!, 30, c, a, a)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 7.68e+09 50.0% # 4.2 cycles per ns
┌ instructions 1.02e+10 75.1% # 1.3 insns per cycle
│ branch-instructions 1.41e+08 75.1% # 1.4% of instructions
└ branch-misses 1.86e+05 75.1% # 0.1% of branch instructions
┌ task-clock 1.81e+09 100.0% # 1.8 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 5.89e+08 75.1% # 28.1% of dcache loads
│ L1-dcache-loads 2.10e+09 75.1%
└ L1-icache-load-misses 3.92e+04 75.1%
┌ dTLB-load-misses 4.61e+05 24.9% # 0.0% of dTLB loads
└ dTLB-loads 2.10e+09 24.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ It averaged at 4.2 GHz, and performed the same number of total instructions (
uops.info is a great resource.
"Throughput" actually means reciprocal throughput, i.e. if you're doing a lot of them, how many clock cycles does it take on average per instruction. So "0.5" means that every 10 clock cycles, it does 10/0.5 = 20 of them. You can think of it as an average time per instruction. But for matrix multiplication, it's just the throughput or average time per instruction that matters. Tiger Lake (my laptop) and Rocket Lake are both architecturally very similar to Ice Lake, and should have essentially all the same numbers (main differences are that Rocket Lake is backported to 14 nm, while Tiger Lake features upgraded caches). The reason why Cascadelake is twice as fast is given by the "Port usage" line:
The You can look up other instructions like For your i5 10400, I'd look up |
WOW, I totally underestimated the hardness to measure a CPU's computing power, learnt a lot from you. I also notice your answer is much more clear than those on stackoverflow. In case you are interested in repost your answer to stackoverflow, here is the link to the question: |
Nice project! I tried the BLAS package Octavian built on top of LoopVectorization, the performance is amazing.
Now I wish to make a BLAS package for Tropcal algebra (a simple semi-ring algebra by replacing + with max, * with +) and more complicated algebra (see the CountingTropical bellow). Wondering how difficult it may be.
Now if I try using the Tropical type, the error information shows
References
https://github.com/TensorBFS/TropicalNumbers.jl/blob/547edebf9abd3247299975c712e59e08eff2df75/src/base.jl#L38
The functions define counting-tropical numbers (L10-21)
https://github.com/TensorBFS/TropicalNumbers.jl/blob/547edebf9abd3247299975c712e59e08eff2df75/src/counting_tropical.jl#L10
Tropical blas is useful in solving combinatoric optimization problems: https://arxiv.org/abs/2008.06888 , in this work, we used the generic matmul on GPU, and it pushed the state of the art in solving spin glasses, potts model and 2-sat counting et. al. I wish the BLAS on CPU can also get a proper speed.
@chriselrod
The text was updated successfully, but these errors were encountered: