-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added some methods to help support LoopVectorization.jl #61
Conversation
README.md
Outdated
@@ -112,6 +112,21 @@ If `step` of instances of type `T` are known at compile time, return that step. | |||
Otherwise, returns `nothing`. For example, `known_step(UnitRange{Int})` returns | |||
`one(Int)`. | |||
|
|||
## is_cpu_column_major(::Type{T}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why cpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to consider accommodating other memory layouts that have been used in improving performance of array operations like these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrisRackauckas
"cpu" because I wanted it to be false
for GPUArrays.
Given that AbstractDeviceArrays <: DenseArray to opt into the StrideArray
interface (even though LinearAlgebra
routines presumably don't work), I wanted to make it clear they shouldn't opt into this.
@Tokazama
The stridelayout
function can efficiently handle row-major or increasing vs decreasing strides, e.g.:
julia> using ArrayInterface: stridelayout
julia> is_row_major(x) = stridelayout(x) === (2,1,(2,1))
is_row_major (generic function with 1 method)
julia> B = rand(3,4);
julia> @code_typed is_row_major(B')
CodeInfo(
1 ─ return true
) => Bool
julia> @code_typed is_row_major(B)
CodeInfo(
1 ─ return false
) => Bool
But perhaps I should add density. Knowing that certain dims are dense could potentially allow subsets of loops to undergo CartesianIndexing
-> LinearIndexing
style loop fusion, even if on the whole the IndexStyle
is IndexCartesian()
.
Or, as a correction for
julia> IndexStyle(typeof(B))
IndexLinear()
julia> IndexStyle(typeof(B'))
IndexCartesian()
Maybe all arrays are transposed (and thus IndexCartesian()
), but if they're still dense, we could use linear indexing anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was specifically thinking of dense arrays but didn't want to chance overlooking something else. Using IndexStyle to accomplish this is pretty clever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added density info on a per-stride basis.
julia> A = rand(3,4,5);
julia> stridelayout(PermutedDimsArray(A,(3,1,2)))
(2, 1, (3, 1, 2), (true, true, true))
julia> stridelayout(@view(PermutedDimsArray(A,(3,1,2))[2,1:2,:]))
(1, 1, (1, 2), (true, false))
julia> stridelayout(@view(PermutedDimsArray(A,(3,1,2))[2:3,1:2,:]))
(2, 1, (3, 1, 2), (false, false, true))
Unfortunately, we can't tell from type info whether a slice of an array was partial or complete:
julia> s2 = Base.Slice(Base.OneTo(2))
Base.Slice(Base.OneTo(2))
julia> view(A,s2,s2,s2)
2×2×2 view(::Array{Float64,3}, :, :, :) with eltype Float64:
[:, :, 1] =
0.329658 0.543774
0.350255 0.0817058
[:, :, 2] =
0.310211 0.126501
0.587191 0.884039
as Base.Slice{Base.OneTo{Int64}}
is the type of a :
slice. This means we have to be very conservative with base SubArray
s, but packages (like PaddedMatrices.jl) might want to provide more compile time info for their types.
The finer granularity would allow for the partial linear-vs-cartesian indexing.
Other reasons to have this information:
- using linear indexing to decide what to do for something like a
vec
implementation is a little goofy. - LoopVectorization often spills scalar integer registers (i.e.,
r*
registers). Many of these registers are used for holding array strides, and integer multiples of these strides. E.g., with a AVX512 column-major matmul kernel forC = A * B
, it wants to holdstride(A,2), 3stride(A,2), 5stride(A,2), stride(C,2), 3stride(C,2), 5stride(C,2)
. So if we know thatstride(A,2) == stride(C,2)
, we could free 3 registers, save a couple calculations, and a bunch of integer load/stores from the stack when we get less spills. I'm not sure how to implement this optimization / let a user guarantee that they aren't writing 100% valid code like
A = rand(4,5);
C = rand(40,50);
for m in axes(A,1), n in axes(A,2)
C[m,n] = 2A[m,n]
end
So knowing that A
and C
are both dense still isn't enough to guarantee that the strides are equal.
Not particularly relevant to this issue, but I wonder if there'd be a good way to communicate size equality. Users will often check for equal sizes in front of a loop, and eachindex
will guarantee this as well
for I in eachindex(A,C) # checks for equivalent size
end
Because eachindex
isn't always appropriate (e.g., when you don't have exact correspondence between axes of the arrays), I think it'd be a good to have something like an eachindexalongaxis
:
function mymatmulkernel!(C, A, B)
@avx for m in eachindexalongaxis((C,A),(1,1)), n in eachindexalongaxis((C,B),(2,2)), k in eachindexalongaxis((A,B),(2,1))
C[m,n] += A[m,k] * B[k,n]
end
end
where it checks sizes. eachindexalongaxis
or each_index_along_axis
doesn't really roll off the tongue, and I don't know where such a function should live.
I'd also have to think a little more about how to use it and actually pass the information of dimension equality (to be combined with density guarantees) to deduce stride-equality. The hacky way would be to just pattern match eachindex
and eachindexalongindex
in the macro, handling dimension equality through the expression, while handling density info based on types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a separate issue about the axes specific indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only think I have left to say about this one is that you may want to include a some explanation or a link so people understand the "cpu" part. It's obviously not a huge deal, but I anticipate it becoming a common question that becomes irksome to continually explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a comment simply saying that it should return false
for any sort of GPUArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work.
README.md
Outdated
The elements of the tuple include | ||
- `contig`: The axis with contiguous elements. `contig == -1` indicates no axis is contiguous. `striderank[contig]` does not necessarilly equal `1`. | ||
- `batch`: indicates the number of contiguous elements. That is, if `batch == 16`, then axis `contig` will contain batches of 16 contiguous elements interleaved with axis `findfirst(isone.(striderank))`. | ||
- `striderank` indicates the rank of the given stride with respect to the others. If for `A::T` we have `striderank[i] > striderank[j]`, then `stride(A,i) > stride(A,j)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add can_avx?
This would probably be good to merge, but I am currently rewriting VectorizationBase and SIMDPirates (combining them into one library, just VectorizationBase) and the parts of LoopVectorization that will need to change to properly support this and a few other modifications. They'll all support/make use of ArrayInterface. Once all that is done, I'll finish the fast convolution example. Another change I'm working on (that contributes a lot to the overall code-churn) is representing unrolled vector operations with a type, rather than successive expressions. This will have a few advantages, e.g.
@avx for i in 1:length(x) >>> 1
a = x[2i-1]
b = x[2i]
# do something with `a` and `b`
end which currently results in 2 |
Personally I'd be tempted to orthogonalize traits as much as possible. So something like Back to the orthogonality, for the trait depot you can do something like this:: myalgorithm(A::AbstractArray) = _myalgorithm((Device(A), StridedLayout(A), some_other_trait(A)), A)
_myalgorithm(::Tuple{CPU,Contiguous{1},<:Any}, A) = # implementation for is_cpu_columnmajor
_myalgorithm(::Tuple{CPU,Contiguous{2},<:Any}, A) = # implementation for row-major
... or if you want to do some things using constant propagation of course there's the lovely tuple-destructuring syntax: _myalgorithm((dev, layout, other)::Tuple{CPU,<:Any,<:Any}, A) = if layout === (1, (whatever)) ... else ... end though it would be worth double-checking whether constant-propagation works flawlessly here. |
src/ArrayInterface.jl
Outdated
is_cpu_column_major(::Type{T}) | ||
|
||
Does an Array of type `T` point to column major memory in the cpu's address space? | ||
If `is_cpu_column_major(typeof(A))` return `true` and the element type is a primite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is primite
type suppose to be primitive
src/ArrayInterface.jl
Outdated
""" | ||
stridelayout(::Type{T}) -> (contig, batch, striderank, axesdense) | ||
|
||
Descrive the memory layout of a strided container of type `T`. If unknown or not strided, returns `nothing`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Descrive
-> Describe
@timholy , Are you proposing that stride rank, batch, and contiguousness be separate traits or just device and stride layout? |
More the general principle that |
I definitely agree with replacing I'll think more about I'll try to start implementing some functions using this tonight. That does mean I'll likely want to make all of these types. tdot(a::Tuple{A}, b::Tuple{B}) where {A,B} = a * b
tdot(a::Tuple, b::Tuple) = first(a) * first(b) + tdot(Base.tail(a), Base.tail(b)) with modifications for each There were special So a goal is to replace all of these with a single julia> struct Contiguous{N} end; Base.@pure Contiguous(N::Int) = Contiguous{N}.instance
Contiguous
julia> totuple(::Contiguous{N}, ::Val{D}) where {N,D} = ntuple(d -> Val(d == N), Val(D))
totuple (generic function with 1 method)
julia> @code_warntype totuple(Contiguous(2), Val(4))
Variables
#self#::Core.Const(totuple, false)
#unused#@_2::Core.Const(Contiguous{2}(), false)
#unused#@_3::Core.Const(Val{4}(), false)
#5::var"#5#6"{2}
Body::Tuple{Val{false},Val{true},Val{false},Val{false}}
1 ─ %1 = Core.apply_type(Main.:(var"#5#6"), $(Expr(:static_parameter, 1)))::Core.Const(var"#5#6"{2}, false)
│ (#5 = %new(%1))
│ %3 = #5::Core.Const(var"#5#6"{2}(), false)
│ %4 = Main.Val($(Expr(:static_parameter, 2)))::Core.Const(Val{4}(), false)
│ %5 = Main.ntuple(%3, %4)::Core.Const((Val{false}(), Val{true}(), Val{false}(), Val{false}()), false)
└── return %5 The Now I'd be able to do something like tdot(a::Tuple{A}, b::Tuple{B}, ::Tuple{Val{false}}) where {A,B} = a * b
tdot(a::Tuple, b::Tuple, c::Tuple{Val{false},Vararg}) = first(a) * first(b) + tdot(Base.tail(a), Base.tail(b), Base.tail(c))
tdot(a::Tuple{A}, b::Tuple{B}, ::Tuple{Val{true}}) where {A,B} = mul_no_promote_MM(a, b)
tdot(a::Tuple, b::Tuple, c::Tuple{Val{true},Vararg}) = mul_no_promote_MM(first(a), first(b)) + tdot(Base.tail(a), Base.tail(b), Base.tail(c)) I may as well separate |
I had some quick things I tested when developing the |
It's a little goofy that What I'm going to use it for is just to know whether or not |
Most of what follows was motivated by the comment you had under If we index something with is_element(x) = is_element(typeof(x))
is_element(::Type{T}) where {T<:Number} = true
is_element(::Type{T}) where {T<:CartesianIndex} = true
is_element(::Type{T}) where {T} = false
@generated is_element(::Type{T}) where {T<:Tuple} = (map(is_element, I.parameters)...,) Generic way to move to the parent dimensions space. to_parent_dim(::Type{T}, ::Val{dim}) where {T,dim} = to_parent_dim(T, dim)
to_parent_dim(::Type{T}, dim::Integer) where {T<:AbstractArray} = dim
to_parent_dim(::Type{T}, dim::Integer) where {T<:Union{<:Adjoint,<:Transpose}} = (2, 1)[dim]
to_parent_dim(::Type{<:PermutedDimsArray{T,N,I}}, i::Integer) where {T,N,I} = I[i]
function to_parent_dim(::Type{<:SubArray{T,N,A,I}}, dim::Integer) where {T,N,A,I}
# if a view is created with something that maps to a single value thena dimension
# is dropped. We add all dimensions dropped up to i
return i + accumulate(+, is_element(I))[dim]
end Then stride_rank(::Type{T}, ::Val{dim}) where {T,dim} = stride_rank(T, dim)
stride_rank(::Type{T}, dim::Integer) where {T} = nothing
stride_rank(::Type{T}, dim::Integer) where {T<:Array} = i
function stride_rank(::Type{T}, i::Integer) where {T<:AbstractArray}
if parent_type(T) <: T # assume no parent
return nothing
else
return stride_rank(parent_type(T), to_parent_dim(T, dim))
end
end I'm not sure if this is what you had in mind as an easier/better way for getting at those values, so if it's not helpful feel free to ignore it. |
That implementation (with a couple fixes below) is a bit more elegant is_element(x) = is_element(typeof(x))
is_element(::Type{T}) where {T<:Number} = true
is_element(::Type{T}) where {T<:CartesianIndex} = true
is_element(::Type{T}) where {T} = false
@generated is_element(::Type{T}) where {T<:Tuple} = (map(is_element, T.parameters)...,)
to_parent_dim(::Type{T}, ::Val{dim}) where {T,dim} = to_parent_dim(T, dim)
to_parent_dim(::Type{T}, dim::Integer) where {T<:AbstractArray} = dim
to_parent_dim(::Type{T}, dim::Integer) where {T<:Union{<:Adjoint,<:Transpose}} = (2, 1)[dim]
to_parent_dim(::Type{<:PermutedDimsArray{T,N,I}}, i::Integer) where {T,N,I} = I[i]
function to_parent_dim(::Type{<:SubArray{T,N,A,I}}, dim::Integer) where {T,N,A,I}
# if a view is created with something that maps to a single value thena dimension
# is dropped. We add all dimensions dropped up to i
return dim + accumulate(+, is_element(I))[dim]
end
stride_rank(::Type{T}, ::Val{dim}) where {T,dim} = stride_rank(T, dim)
stride_rank(::Type{T}, dim::Integer) where {T} = nothing
stride_rank(::Type{T}, dim::Integer) where {T<:Array} = dim
function stride_rank(::Type{T}, dim::Integer) where {T<:AbstractArray}
if parent_type(T) <: T # assume no parent
return nothing
else
return stride_rank(parent_type(T), to_parent_dim(T, dim))
end
end
stride_rank(::Type{T}) where {D, T <: AbstractArray{<:Any,D}} = ntuple(d -> stride_rank(T, d), Val{D}())
stride_rank(A) = stride_rank(typeof(A)) The difference in behavior is whether stride ranks should map to the original parents, or whether julia> A = rand(2,3,4,5);
julia> stride_rank(@view(PermutedDimsArray(A, (3,2,4,1))[:,1,:,:]))
(3, 4, 1)
julia> ArrayInterface.stride_rank(@view(PermutedDimsArray(A, (3,2,4,1))[:,1,:,:]))
(2, 3, 1)
julia> stride_rank(@view(PermutedDimsArray(@view(PermutedDimsArray(A, (3,2,4,1))[:,1,:,:]), (2,3,1))[:,:,1])')
(1, 4)
julia> ArrayInterface.stride_rank(@view(PermutedDimsArray(@view(PermutedDimsArray(A, (3,2,4,1))[:,1,:,:]), (2,3,1))[:,:,1])')
(1, 2) While the With all that in mind, I decided I like that behavior better and updated the PR. |
I'd be fine with that.
I'd be fine with that. I hope that eventually we can just have |
Agreed this is the right way forward in the longer term. See also JuliaArrays/StaticArrays.jl#807 Speaking of static integers, did you consider using |
> juliamaster -O3 -e "@time using ArrayInterface"
0.099558 seconds (126.48 k allocations: 8.313 MiB)
> juliamaster -O3 -e "@time using StaticNumbers"
0.301221 seconds (635.21 k allocations: 35.458 MiB)
> juliamaster -O3 -e "@time using StaticNumbers, ArrayInterface"
0.808027 seconds (1.47 M allocations: 82.250 MiB, 1.13% gc time)
> julia -O3 -e "@time using ArrayInterface"
0.177640 seconds (294.34 k allocations: 16.299 MiB, 2.82% gc time)
> julia -O3 -e "@time using StaticNumbers"
0.177957 seconds (358.17 k allocations: 20.333 MiB, 2.80% gc time)
> julia -O3 -e "@time using StaticNumbers, ArrayInterface"
0.925970 seconds (3.28 M allocations: 162.273 MiB, 2.31% gc time) The load time seems relatively high, although much of it is probably But I wouldn't be opposed to us more broadly getting behind a package like |
Additionally, supporting other static numbers than |
That time added by
It doesn't look bad. Constructor of |
There are a couple invalidations in ArrayInterface itself: julia> using SnoopCompile
julia> invalidations = @snoopr using ArrayInterface;
julia> trees = invalidation_trees(invalidations)
2-element Vector{SnoopCompile.MethodInvalidations}:
inserting promote_rule(::Type{Bool}, ::Type{S}) where S<:ArrayInterface.Static in ArrayInterface at /home/chriselrod/.julia/dev/ArrayInterface/src/static.jl:34 invalidated:
backedges: 1: superseding promote_rule(::Type{Bool}, ::Type{T}) where T<:Number in Base at bool.jl:4 with MethodInstance for promote_rule(::Type{Bool}, ::Type{var"#s431"} where var"#s431"<:Integer) (2 children)
inserting convert(::Type{T}, ::ArrayInterface.Static{N}) where {T<:Number, N} in ArrayInterface at /home/chriselrod/.julia/dev/ArrayInterface/src/static.jl:18 invalidated:
backedges: 1: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{Int8}, ::Integer) (1 children)
2: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{UInt8}, ::Integer) (1 children)
3: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{Int64}, ::Integer) (9 children)
4: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{UInt64}, ::Integer) (27 children) Can we do anything about them? I removed the |
I'd like to set the goal of merging this on Friday. I'm open to |
I'm not completely against StaticNumbers.jl (I've even brought it up in the past), but I'm not 100% sold on it. Even if we decide the niche use case for other static numbers is sufficient to justify supporting them, I'm not sure we're ready to handle things here like If we do decide to support all static numbers I think that's going to have to be a separate, large PR. Once you support different types of numbers for ranges that have different step sizes it gets pretty complicated (e.g., safely handling code with smaller byte sizes, handling high precision number types, accounting for numbers with units attached, etc.). It seems like the most straightforward approach is to just use |
FWIW I think it's completely fine to use
Perhaps we should make a StaticNumbersBase.jl to define only the core functionality. @perrutquist what are your thoughts? |
I think a StaticNumbersBase.jl package would be a good idea. I've been thinking for a while that I need to split up that package. (It started out as a generalization of Zeros.jl, and hence with real numbers, but the integer use-case is fairly orthogonal to that.) Maybe even better if we could have a StaticBase.jl which also contains stubs and abstract types for static arrays and iterators? That way it would be easier to define interaction without using Requires. However, it seems that most of the load time (and post-load slowdown) in StaticNumbers.jl comes from a single method definition: |
I don't really have any experience here, but this by itself at least doesn't seem to take very long on Julia 1.6: julia> using SnoopCompile
julia> invalidations = @snoopr begin
struct Static{N} <: Integer end
(::Type{T})(::Static{N}) where {T <: Number, N} = T(N)
end;
julia> invalidation_trees(invalidations)
1-element Vector{SnoopCompile.MethodInvalidations}:
inserting (::Type{T})(::Static{N}) where {T<:Number, N} in Main at REPL[2]:3 invalidated:
mt_backedges: 1: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for findnext(::Pkg.Types.var"#22#23"{String}, ::AbstractString, ::Integer) (0 children)
2: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for _array_for(::Type{T}, ::Any, ::Base.HasLength) where T (0 children)
3: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for to_shape(::Integer) (0 children)
4: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for _array_for(::DataType, ::Any, ::Base.HasLength) (0 children)
5: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for _similar_for(::Vector{_A} where _A, ::Type, ::Base.Generator{_A, Base.var"#811#813"
} where _A, ::Base.HasLength) (0 children)
6: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for _similar_for(::Vector{_A} where _A, ::DataType, ::Base.Generator{_A, Base.var"#811#
813"} where _A, ::Base.HasLength) (0 children)
7: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for _array_for(::Type{Base.RefValue{Any}}, ::Any, ::Base.HasLength) (0 children)
8: signature Tuple{Type{UInt16}, Integer} triggered MethodInstance for connect!(::Sockets.TCPSocket, ::Sockets.IPv4, ::Integer) (0 children)
9: signature Tuple{Type{UInt16}, Integer} triggered MethodInstance for connect!(::Sockets.TCPSocket, ::Sockets.IPv6, ::Integer) (0 children)
10: signature Tuple{Type{UInt32}, Integer} triggered MethodInstance for VersionNumber(::Integer, ::Int64, ::Int64, ::Tuple{}, ::Tuple{}) (1 children)
11: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for manage(::Distributed.SSHManager, ::Integer, ::Distributed.WorkerConfig, ::Symbol) (
1 children)
12: signature Tuple{Type{UInt64}, Integer} triggered MethodInstance for convert(::Type{UInt64}, ::Integer) (7 children)
13: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for prevind(::String, ::Integer) (11 children)
14: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for full_va_len(::Core.SimpleVector) (784 children) Although 784 sounds like a lot of children (for It seems odd to me that julia> struct Static{N} <: Integer end
julia> @which Int(Static{5}())
ERROR: no unique matching method found for the specified argument types there are no methods, yet we can't insert one without invalidating others? I'd be in favor of a While I like the brevity of |
I'm only speculating, but I think the invalidations come from methods that were defined when If you're thinking of renaming |
I played with static numbers pretty extensively a while ago. There are all sorts of quirky things that will get you if you're not careful. I recall getting tripped up on ensuring type stability with a certain type of log method. |
You have to be careful with |
I'm assuming you mean the difference between |
I meant julia> using StaticArrays, ArrayInterface
julia> A = @SMatrix rand(4,3);
julia> @code_warntype getindex(ArrayInterface.size(A), 1)
Variables
#self#::Core.Compiler.Const(getindex, false)
t::Core.Compiler.Const((ArrayInterface.StaticInt{4}(), ArrayInterface.StaticInt{3}()), false)
i::Int64
Body::Union{ArrayInterface.StaticInt{4}, ArrayInterface.StaticInt{3}}
1 ─ nothing
│ %2 = Base.getfield(t, i, $(Expr(:boundscheck)))::Union{ArrayInterface.StaticInt{4}, ArrayInterface.StaticInt{3}}
└── return %2
julia> @code_warntype getindex(ArrayInterface.size(A), ArrayInterface.StaticInt(1))
Variables
#self#::Core.Compiler.Const(getindex, false)
t::Core.Compiler.Const((ArrayInterface.StaticInt{4}(), ArrayInterface.StaticInt{3}()), false)
i::Core.Compiler.Const(ArrayInterface.StaticInt{1}(), false)
Body::ArrayInterface.StaticInt{4}
1 ─ nothing
│ %2 = Base.convert(Base.Int, i)::Core.Compiler.Const(1, false)
│ %3 = Base.getfield(t, %2, $(Expr(:boundscheck)))::Core.Compiler.Const(ArrayInterface.StaticInt{4}(), false)
└── return %3 You need to make sure the indices into these tuples const-prop so that the result can be inferred. |
The methods are:
canavx(f)
, a function whitelist for supporting@avx
.is_cpu_column_major
. The name should be explanatory; figured it'd be best to go with something unambiguous.stridelayout
. Provides a description of the memory layout.Tagging @LaurentPlagne, we we discussed the intention of
stridelayout
earlier.stridelayout
is meant to decouple the axis that is good for SIMD (indicated bycontig
) from cache-friendliness (lowerrank
s). Our goal additionally is to decouple these from the array's interface, so that we can easily optimize the memory layouts of our arrays without having to touch any of the code.This is facilitated by LoopVectorization, which will adjust the order of loops as necessary, making our code generic yet still optimal with respect to memory layout.
If you think of a
Vector{ComplexF64}
as a2
xN
Matrix{Float64}
, then aStructOfArray
s is essentially just atranspose
, to allow you to interact with the array as before even though the memory layout is now essentially aN
x2
Matrix{Float64}
, which should generally be much friendlier to SIMD.However, the memory accesses for
re
andim
numbers may now be far apart. While unlikely to be a problem in this example, this isn't necessarily the case for more general transformations of large arbitrary-dimensional arrays.In this example,
contig,batch=2,4
yield a memory layout of4
x2
xN/4
. That is, with theAVX
instruction set, we could easily SIMD-load 4 contiguous elements of real and imaginary numbers at a time, yet these subsequent loads would still be contiguous and thus likely to trigger the DCU hardware prefetcher, even in more complicated examples.As another example, for AVX-matrix multiplication
A * B
with contiguous-access packing, ourm_c x k_c
packed blocks ofA
would be reshaped into8
xk_c
xm_c/8
. That is, it would correspond tostridelayout(packed_A) == (1,8,(2,1))
:m_c
is contiguous with batches of size8
, but has a higher stride thank_c
.Being able to simply try out a bunch of different memory layouts without needing to change the code for correctness (thanks to API-memory layout decoupling) or performance (thanks to LoopVectorization's loop-reordering) should make this critical aspect of performance optimization much simpler, and even open the door to automation.