-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with LinearAlgebra.transpose() #20803
Comments
@ghbrown — Sorry for the delayed response on this one, it's been a busy week. Jumping to the most interesting bit: [@vasslitvinov : Tagging you here because the following question relates to recent discussions]
We've also been wrestling with whether this is appropriate or not in #18215 and #17122. E.g., see my third bullet in #18215 (comment) I don't think the historical rationale here is a good one. Basically, I think we were making domains similarly defaulted as ranges where the idxType defaults to Your intuition to switch from arguments like
I'm not aware of any reason that that should be the case... though I'm also not really clear on whether / why there should be a special-case here at all. Given that the L.A. library supports a vector type (right?), it would seem odd to special-case degenerate matrices in this routine. Interestingly, if I go back in time that routine, circa 2017, it had special cases for either dimension being 1 element. Aha, I see what happened (and it's my fault!). In this commit, we were changing from 1- to 0-based indices for tuples, but I made a mistake and used 0/0 rather than 0/1: And then later someone dead-code eliminated the second special-case: That said, today my instinct would be to remove both special-cases rather than restoring the other one. Am I missing something, or would you agree that the general case should handle both of the special-cases and generate the same result? (I guess the fact that testing didn't fail when we made these changes either suggests it does, or that we don't have tests covering this case) I'm also having trouble thinking of anything about the use of re-shape that would make it faster... |
Thanks for the insights, @bradcray. I think consistency is probably most important between the two. But here's my (still somewhat newbie) opinion on the matter:
Admittedly, I don't know the arguments for letting them be totally concrete then using As for the |
@ghbrown: I think that my thinking on ranges, domains, and genericity has become similar to yours over the years. I think the reason we installed these defaults to begin with was that they felt like "the common case" (so wouldn't that be convenient?), but I think in practice, people tend to lean on the type inference capabilities so heavily, that it's more of a wash. And meanwhile cases like
That would be awesome. L.A. is a module that's been needing someone to come give it some love and attention for awhile now, and anything you can help out with would certainly be appreciated. I/We can help on our end by running testing suites for you using our machine cycles. I don't think range/domain changes are imminent, but I am hoping they'll be coming along in the next few months. I don't know that I'd hold off over that necessarily, and we'll help make those updates as the changes come online, whatever state the module is in, to keep things working. I think we should also probably ask about other ways of writing a 2D-but-otherwise-generic-domain to the broader team to make sure I'm not missing a form that would work, since it'd be nice to write it in a closed form rather than relying on a where clause (if nothing else, it'd make future edits easier). |
See also a discourse thread: https://chapel.discourse.group/t/matrix-transpose/18488 |
Before I reply here, is the following definition: proc isDenseMatrrix(a : []) param
{
return a.rank == 2 && a.isRectangular() && a.hasSingleLocalSubdomain();
} which uses only array primitive methods identical in meaning to the slightly more convoluted version in LinearAlgebra.chpl. |
The transpose routine within LinearAlgebra has parallel/forall overhead when used with smallish matrices, i.e. those with are smaller than say 64*64. In the latter case, a serial/for as used in the reshape primitive might be more optimal. I am only talking single locale stuff. Has any extensive testing been done of the transpose operation in terms of when parallel is faster than serial, optimal blocking size for large matrices, and so on? I have access to some Xeon workstations (up to 52-cores) but I would be curious as to the performance on ARM, e.g. Fujitsu A64FX (as in a HPE Apollo), Apple M1 Pro/Max, or even ARM Cortex A72 (in a Raspberry Pi.) Is Chapel's recursion low or medium overhead. I was surprised at how little is seemed in the testing I did while writing that discourse discussion mentioned by @vasslitvinov . As a matter of interest, what size matrices is it envisaged that LinearAlgebra should deal with. Also, what sort of application do you have @ghbrown which wants the transpose of a array object that has a strided range? |
@damianmoz I don't have any particular application in mind, I'm just working towards completion of JuliaLang/julia#17624.
I personally have not done anything like this for my contributions to My condensed opinion on the subject is that As for that Discourse post, I'm happy to post it on Discourse or as an issue here on GitHub for discussion, but I figured I would keep quiet until I could help stabilize the |
Do you have a project plan for your work on JuliaLang/julia#17624. Where is OpenBLAS in all of this? It is the first time I see mention of this but maybe I am looking in the wrong places. I would be interested in your posting just to see what the current internal thinking is? Here is as good as any rather than Discourse. |
Feel free to use any of the following ideas that you find useful in a new version of transpose. If not, bin them. This work came out of the use of Chapel within other things, the Jacobi Singular Value Decomposition (SVD) of a smallish matrix and some use of the transpose operation within some finite element program(s). Looking at the current transpose, any revamp of transpose should look at addressing the poor performance for both the case of small matrices and much larger matrices. The former is due to the excessive parallel overhead in its internal implicit forall for the small matrix case. The latter is caused by pages faults (cache misses) when accessing the transpose by columns (while concurrently accessing the original by rows). We might discuss this revamp and how is differs from the current transpose. While transpose's intermal use reshape on what is an Nx1 array seeks to address problem (a), it is only part of the solution to it. The reshape routine uses an array assignment construct which even in 1.29.0 has serious performance issues so we will avoid and instead use a simple for loop to handle small matrix cases. Testing on Xeons says that a 16*16 matrix can be regarded as small. Further testing across other hardware might refine that size estimate. For now, we will parameterise this as some blocksize, or bs which we will set to a default of 16. The problem with (b) can be addressed by blocking the matrix into hypermatrices which themselves are small matrices, a long known approach to accessing a large matrix. We are only interested in matrices with unit stride. Anything else is of little value to the way most of us attack Linear Algebra algorithms. If we come across an array with a stride which is not one (1), we push it into a unit stride matrix, work on that temporary copy, and then copy it back. That might not be what others outside our research team and our customers want. We also need to provide an option to do all the work in serial mode in case we want to use the revamp within a task which itself is a subtask of some parallel construct where trying to go parallel within this subtask complicates things big time and cripples performance. So, given a matrix A, we want a matrix AT which is its transpose. It will also use a blocksize bs and have a way to turn parallelism on or off by say a switch pa. Its prototype would then look like: proc patranspose(AT : [?D] ?R, const A : [?S] R, bs = 16, pa = true)
``` where D is the destination domain of AT and S is the source domain of A. Those arrays default to a ref intent. Unlike transpose which creates it own space for the transpose, we wanted to use space that had already been allocated. The index neutral routine that we devised follows: proc givenLinearAlgebraMatrix(A : []) param
{
return A.rank == 2 && A.isRectangular() && A.hasSingleLocalSubdomain() && !A.stridable;
}
// cache optimized matrix transpose - parallel
// Divide the matrix into virtual blocksize*blocksize hypermatrices, and
// transpose each such submatrix. This is probably slower than 'cotranspose'
// for matrices which are smaller than 100x100 with blocksizes of 16 or 32.
proc patranspose(AT : [?D] ?R, const A : [?S] R, bs = 32, pa = true) where
givenLinearAlgebraMatrix(A)
{
inline proc hyperRange(k) // range for hypermatrices
{
return 0 .. (k + bs - 1) / bs - 1;
}
inline proc blockRange(lo, i, hi) // range within a hypermatrix
{
const _lo = lo + i * bs;
return _lo .. min(_lo + bs - 1, hi);
}
inline proc core(rows : range, columns : range, v : [] R, u : [] R)
{
for (r, c) in { rows, columns }
{
v[c, r] = u[r, c];
}
}
inline proc trn(rows : range, columns : range, v : [] R, u : [] R)
{
const (m, rlo, rhi) = (rows.size, rows.low, rows.high);
const (n, clo, chi) = (columns.size, columns.low, columns.high);
forall (r, c) in { hyperRange(m), hyperRange(n) }
{
core(blockRange(rlo, r, rhi), blockRange(clo, c, chi), v, u);
}
}
inline proc trnserial(rows : range, columns : range, v : [] R, u : [] R)
{
const (m, rlo, rhi) = (rows.size, rows.low, rows.high);
const (n, clo, chi) = (columns.size, columns.low, columns.high);
for (r, c) in { hyperRange(m), hyperRange(n) }
{
core(blockRange(rlo, r, rhi), blockRange(clo, c, chi), v, u);
}
}
param failure = (-1), success = 0;
const (rows, columns) = S.dims();
const (_rows, _columns) = D.dims();
var endState = success;
if rows.size == _columns.size && _rows.size == columns.size
{
ref t = AT.reindex(columns, rows);
if S.size < bs * bs then
{
core(rows, columns, t, A);
}
else if pa then
{
trn(rows, columns, t, A);
}
else
{
trnserial(rows, columns, t, A);
}
}
else
{
endState = failure;
}
return endState;
} The above is built around a core routine which handles the small matrix case. It returns zero on success, and non-zero otherwise. The (a) problem disappears because the small case is done quickly with no fancy parallelism. The (b) problem disappears by using a hypermatrix approach. The above is about 3 or 4 or more times faster than LinearAlgebra's transpose for matrices from 1010 up to 400400.. The idea of using OpenBLAS seems suboptimal. Technically BLIS could be a better choice if the idea is to use some low level C library. But the choice of either ignores the fact that Chapel has the potential to do BLAS operations natively. It should be possible to write such functionality in 4 to 8 times less code than something like OpenBLAS or BLIS needs. That benefit can then be one of the selling points of Chapel. But the performance of that Chapel code needs to be within spitting distance of the performance of what something like BLIS (or OpenBLAS or whatever) provides. There are still some issues before that is achievable with the current Chapel compiler but it needs to be an end-goal of Chapel. In fact I would probably deem Chapel a failure if it cannot get to the point where something like a low level C library for something like BLAS operation is considered. Our 2c. |
My chain of planned PRs started when I tried to help complete a PR for the pseudoinverse (#17393). If I remember correctly, the generality I wanted to provide here was not supported by
OpenBLAS has not been discussed much here yet, at least by me. However, when surveying other successful linear algebra libraries (Numpy and Julia's Interestingly, there are a few cases in Julia (including transpose) where they have rolled their own BLAS-like routines, here they compare OpenBLAS to native Julia routines. It appears they do benefit from using Julia-native routines in some cases, including transpose on square matrices. I'm not sure if/how they handle platform optimization of their own routines, or if they choose some reasonable and fixed blocking sizes and let performance be what it will on different architectures.
My intended changes to the transpose routine were only going to be dead code elimination ( |
We implemented cache-oblivious alternatives in our work on transpose that worked wonderfully in serial mode but they did not parallelize. Because of that, we did not pursue this approach. But feel free to chase that line if you can make it work. |
Some miscellaneous points, Please mention what sized matrices you used for testing. Mine was 22 -> 10001000. At least for the finite element stuff with which I mess in C, where the matrices are not huge, BLIS and OpenBLAS really do not buy much against well written C code. If there is enough interest, I might publish the incomplete BLAM stuff we use in place of BLIS. But it needs a lot more love and attention. @bradcray was a bit worried from a performance perspective that our cache-oblivious routine used recursion but Chapel just ate it up so his concerns were misplaced. It is dominated by data movement issues anyway. For our sized matrices, this routine was actually good enough. Note that we did not look at an in-place transpose if that is what you need. You can argue that our core routine should have the option to totally unroll say the 8x8 or even the 16x16 cases. Code like the following which could be used in the core routine, ref Acore = A[rows, columns];
ref ATcore = AT[columns, rows];
AT = for a in A do a; // copy to the transpose without needing indices. runs slowly in 1.29.0. However, I believe work will be done on the performance of this sort of expression in the near future. For BLIS, see also
|
A serial implementation of cache oblivious transpose holds its own against the current naive parallel transpose routine up until about a 200*200 matrix. After that, serial code does not complete with the parallel code. Interesting. |
Issue 1
Similar to JuliaLang/julia#20789, the transpose routine fails for strided ranges. I dug into it a bit, and believe I found the cause, which ends up being in line with other similar conflicts between
LinearAlgebra
and strided ranges.Here's the copied source code for the transpose of dense matrices:
Failure for strided ranges comes in the third line when transpose gets called on a (possibly strided) domain, while the transpose function for domains only supports unit stride domains (
inline proc transpose(D: domain(2))
).An outstanding question I've had for a while is this: why does
domain(2)
impose unit stride? Ifdomain
is completely generic, then my intuition is that specifying only the order of the array (2 here) shouldn't constrain other properties of the domain, like stridedness, etc. (I'm sure there's a compelling compiler reason, but nevertheless it seems odd to me).Otherwise it appears this version of the transpose routine should be robust against strided ranges, which I've confirmed via testing.
Issue 2
Notice in the source code that the first logical check is to determine whether the input array is really a 1-D array in disguise (as a row vector). Why special case only row vectors and not column vectors with the reshape?
Given that stridedness should probably be supported as a part of JuliaLang/julia#17624, I suspect we'll want to make similar change to the one tested by Brad in the linked issue where
range -> range(?)
(until ranges are made generic) we takedomain(2) -> domain || domain(2,?)
or whatever the more permissive syntax is.Configuration Information
chpl --version
:The text was updated successfully, but these errors were encountered: