Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with LinearAlgebra.transpose() #20803

Open
ghbrown opened this issue Oct 9, 2022 · 13 comments
Open

Issues with LinearAlgebra.transpose() #20803

ghbrown opened this issue Oct 9, 2022 · 13 comments

Comments

@ghbrown
Copy link
Contributor

ghbrown commented Oct 9, 2022

Issue 1

Similar to JuliaLang/julia#20789, the transpose routine fails for strided ranges. I dug into it a bit, and believe I found the cause, which ends up being in line with other similar conflicts between LinearAlgebra and strided ranges.

Here's the copied source code for the transpose of dense matrices:

proc copy_transpose(A: [?Dom] ?eltType) where isDenseMatrix(A) {
  if Dom.shape(0) == 1 {
    return reshape(A, transpose(Dom));
  } else {
    const rDom = {Dom.dim(1), Dom.dim(0)};
    var C: [rDom] eltType;

    [(i, j) in Dom] C[j, i] = A[i, j];

    return C;
  }
}

Failure for strided ranges comes in the third line when transpose gets called on a (possibly strided) domain, while the transpose function for domains only supports unit stride domains (inline proc transpose(D: domain(2))).

An outstanding question I've had for a while is this: why does domain(2) impose unit stride? If domain is completely generic, then my intuition is that specifying only the order of the array (2 here) shouldn't constrain other properties of the domain, like stridedness, etc. (I'm sure there's a compelling compiler reason, but nevertheless it seems odd to me).

Otherwise it appears this version of the transpose routine should be robust against strided ranges, which I've confirmed via testing.

Issue 2

Notice in the source code that the first logical check is to determine whether the input array is really a 1-D array in disguise (as a row vector). Why special case only row vectors and not column vectors with the reshape?

Given that stridedness should probably be supported as a part of JuliaLang/julia#17624, I suspect we'll want to make similar change to the one tested by Brad in the linked issue where range -> range(?) (until ranges are made generic) we take domain(2) -> domain || domain(2,?) or whatever the more permissive syntax is.

Configuration Information

  • Output of chpl --version:
chpl version 1.29.0 pre-release (8c86ba7079)
  built with LLVM version 14.0.6
@bradcray
Copy link
Member

@ghbrown — Sorry for the delayed response on this one, it's been a busy week. Jumping to the most interesting bit:

[@vasslitvinov : Tagging you here because the following question relates to recent discussions]

An outstanding question I've had for a while is this: why does domain(2) impose unit stride?

We've also been wrestling with whether this is appropriate or not in #18215 and #17122. E.g., see my third bullet in #18215 (comment) I don't think the historical rationale here is a good one. Basically, I think we were making domains similarly defaulted as ranges where the idxType defaults to int and the stride-ability to "unit only". Only the rank is non-defaulted on rectangular domains, on the argument that there isn't an obvious default (Vass has argued it could be 1 and we could make both types fully-defaulted for consistency; my preference would be to go the other way and make them both either completely, or at least similarly, non-defaulted).

Your intuition to switch from arguments like domain(2) to domain(2,?) seems correct to me, though I don't think that syntax works today (but arguably it should). The way I was able to get this to go through was to use a formal type of D: domain (which is a completely generic domain) and then constrain it with where D.rank == 2.

Why special case only row vectors and not column vectors with the reshape?

I'm not aware of any reason that that should be the case... though I'm also not really clear on whether / why there should be a special-case here at all. Given that the L.A. library supports a vector type (right?), it would seem odd to special-case degenerate matrices in this routine. Interestingly, if I go back in time that routine, circa 2017, it had special cases for either dimension being 1 element. Aha, I see what happened (and it's my fault!). In this commit, we were changing from 1- to 0-based indices for tuples, but I made a mistake and used 0/0 rather than 0/1:

fd62047

And then later someone dead-code eliminated the second special-case:

#16005

That said, today my instinct would be to remove both special-cases rather than restoring the other one. Am I missing something, or would you agree that the general case should handle both of the special-cases and generate the same result? (I guess the fact that testing didn't fail when we made these changes either suggests it does, or that we don't have tests covering this case) I'm also having trouble thinking of anything about the use of re-shape that would make it faster...

@ghbrown
Copy link
Contributor Author

ghbrown commented Oct 14, 2022

Thanks for the insights, @bradcray. I think consistency is probably most important between the two. But here's my (still somewhat newbie) opinion on the matter:

  • I'd prefer range and domain to be generic.
  • It seems unnatural that those words would carry connotations like implicit/preset values of stridedness and other properties (especially domain), or that I'd have to (in a way) erase the values of those properties to get a generic version of these types.
  • Treating even range as (partially?) concrete has caused multiple surprising (and I'd say somewhat subtle) issues like the ones that led to recent issues of mine about LinearAlgebra.

Admittedly, I don't know the arguments for letting them be totally concrete then using <range/domain>(?) everywhere. Though the probability of tripping up new users (like me) is admittedly higher, the number of characters is about the same and otherwise not a hassle (unless you want to do something with constraints which would appear to need where like Brad suggested, and that's rather complicated and wordy, especially for beginners).

As for the transpose related stuff: what an amusing history of that function; yes, I think removing the special cases is the way to go, especially if there aren't clear performance implications. Since I finally grasp what most of the stridedness issues are about and I'd like to get LinearAlgebra in better shape, I'm planning a series of PRs to rectify things (unless range/domain changes are imminent), at least for Vector, Matrix (see JuliaLang/julia#20810), and transpose, so I can handle this change if you want.

@bradcray
Copy link
Member

@ghbrown: I think that my thinking on ranges, domains, and genericity has become similar to yours over the years. I think the reason we installed these defaults to begin with was that they felt like "the common case" (so wouldn't that be convenient?), but I think in practice, people tend to lean on the type inference capabilities so heavily, that it's more of a wash. And meanwhile cases like var r: range = 1..10 by 2; have been annoying and confusing to folks. In the prototype that Vass implemented, the main downside seemed to be that for users who really prefer to strongly type everything (for documentation's sake? paranoia? old habits dying hard?), it's more typing in that case. But I've recently been arguing that if their goal is to type everything to document it, then that shouldn't bother them too much because their documentation is just getting that much more precise. That said, we haven't made a final call here yet, though input like yours is definitely valuable as we wrestle with it.

I'm planning a series of PRs to rectify things

That would be awesome. L.A. is a module that's been needing someone to come give it some love and attention for awhile now, and anything you can help out with would certainly be appreciated. I/We can help on our end by running testing suites for you using our machine cycles. I don't think range/domain changes are imminent, but I am hoping they'll be coming along in the next few months. I don't know that I'd hold off over that necessarily, and we'll help make those updates as the changes come online, whatever state the module is in, to keep things working. I think we should also probably ask about other ways of writing a 2D-but-otherwise-generic-domain to the broader team to make sure I'm not missing a form that would work, since it'd be nice to write it in a closed form rather than relying on a where clause (if nothing else, it'd make future edits easier).

@vasslitvinov
Copy link
Member

See also a discourse thread: https://chapel.discourse.group/t/matrix-transpose/18488

@damianmoz
Copy link

Before I reply here, is the following definition:

proc isDenseMatrrix(a : []) param
{
     return a.rank == 2 && a.isRectangular() && a.hasSingleLocalSubdomain();
}

which uses only array primitive methods identical in meaning to the slightly more convoluted version in LinearAlgebra.chpl.

@damianmoz
Copy link

The transpose routine within LinearAlgebra has parallel/forall overhead when used with smallish matrices, i.e. those with are smaller than say 64*64. In the latter case, a serial/for as used in the reshape primitive might be more optimal. I am only talking single locale stuff.

Has any extensive testing been done of the transpose operation in terms of when parallel is faster than serial, optimal blocking size for large matrices, and so on? I have access to some Xeon workstations (up to 52-cores) but I would be curious as to the performance on ARM, e.g. Fujitsu A64FX (as in a HPE Apollo), Apple M1 Pro/Max, or even ARM Cortex A72 (in a Raspberry Pi.)

Is Chapel's recursion low or medium overhead. I was surprised at how little is seemed in the testing I did while writing that discourse discussion mentioned by @vasslitvinov .

As a matter of interest, what size matrices is it envisaged that LinearAlgebra should deal with.

Also, what sort of application do you have @ghbrown which wants the transpose of a array object that has a strided range?

@ghbrown
Copy link
Contributor Author

ghbrown commented Jan 6, 2023

@damianmoz I don't have any particular application in mind, I'm just working towards completion of JuliaLang/julia#17624.

Has any extensive testing been done of the transpose operation in terms of when parallel is faster than serial, optimal blocking size for large matrices, and so on?

I personally have not done anything like this for my contributions to LinearAlgebra. I think the community should agree upon the intent and capabilities for LinearAlgebra, which might determine how important these kind of optimizations are to the package. I have a somewhat lengthy Discourse post queued up on this subject ("what should LinearAlgebra be, what audience does it have, how should we implement it to provide these features, how much should we rely on existing tools like OpenBLAS, etc.") including comparisons to other languages and so on.

My condensed opinion on the subject is that LinearAlgebra should provide a general (in the sense of features, types of arrays, kinds of indices) but not necessarily performant interface to a wide range of linear algebra operations. I believe this opinion is somewhat inline with other members of the community, but it's also been a while. Certainly this is no excuse to write slow code, but I don't think LinearAlgebra is ready for these kind of lower level optimizations right now. Partially this is because this kind of optimizations begin to step on the toes of OpenBLAS (which could be a very long road of duplicated effort) and would significantly complicate the code.


As for that Discourse post, I'm happy to post it on Discourse or as an issue here on GitHub for discussion, but I figured I would keep quiet until I could help stabilize the LinearAlgebra landscape a bit (hopefully via completion of JuliaLang/julia#15621 and JuliaLang/julia#17624, and perhaps a few others).

@damianmoz
Copy link

Do you have a project plan for your work on JuliaLang/julia#17624. Where is OpenBLAS in all of this? It is the first time I see mention of this but maybe I am looking in the wrong places.

I would be interested in your posting just to see what the current internal thinking is? Here is as good as any rather than Discourse.

@damianmoz
Copy link

Feel free to use any of the following ideas that you find useful in a new version of transpose. If not, bin them. This work came out of the use of Chapel within other things, the Jacobi Singular Value Decomposition (SVD) of a smallish matrix and some use of the transpose operation within some finite element program(s).

Looking at the current transpose, any revamp of transpose should look at addressing the poor performance for both the case of small matrices and much larger matrices. The former is due to the excessive parallel overhead in its internal implicit forall for the small matrix case. The latter is caused by pages faults (cache misses) when accessing the transpose by columns (while concurrently accessing the original by rows). We might discuss this revamp and how is differs from the current transpose.

While transpose's intermal use reshape on what is an Nx1 array seeks to address problem (a), it is only part of the solution to it. The reshape routine uses an array assignment construct which even in 1.29.0 has serious performance issues so we will avoid and instead use a simple for loop to handle small matrix cases. Testing on Xeons says that a 16*16 matrix can be regarded as small. Further testing across other hardware might refine that size estimate. For now, we will parameterise this as some blocksize, or bs which we will set to a default of 16.

The problem with (b) can be addressed by blocking the matrix into hypermatrices which themselves are small matrices, a long known approach to accessing a large matrix.

We are only interested in matrices with unit stride. Anything else is of little value to the way most of us attack Linear Algebra algorithms. If we come across an array with a stride which is not one (1), we push it into a unit stride matrix, work on that temporary copy, and then copy it back. That might not be what others outside our research team and our customers want.

We also need to provide an option to do all the work in serial mode in case we want to use the revamp within a task which itself is a subtask of some parallel construct where trying to go parallel within this subtask complicates things big time and cripples performance.

So, given a matrix A, we want a matrix AT which is its transpose. It will also use a blocksize bs and have a way to turn parallelism on or off by say a switch pa. Its prototype would then look like:

proc patranspose(AT : [?D] ?R, const A : [?S] R, bs = 16, pa = true)
                ```

where D is the destination domain of AT and S is the source domain of A. Those arrays default to a ref intent.

Unlike transpose which creates it own space for the transpose, we wanted to use space that had already been allocated.

The index neutral routine that we devised follows:

proc givenLinearAlgebraMatrix(A : []) param
{
        return A.rank == 2 && A.isRectangular() && A.hasSingleLocalSubdomain() && !A.stridable;
}

// cache optimized matrix transpose - parallel

// Divide the matrix into virtual blocksize*blocksize hypermatrices, and
// transpose each such submatrix. This is probably slower than 'cotranspose'
// for matrices which are smaller than 100x100 with blocksizes of 16 or 32.

proc patranspose(AT : [?D] ?R, const A : [?S] R, bs = 32, pa = true) where
        givenLinearAlgebraMatrix(A)
{
        inline proc hyperRange(k) // range for hypermatrices
        {
                return 0 .. (k + bs - 1) / bs - 1;
        }
        inline proc blockRange(lo, i, hi) // range within a hypermatrix
        {
                const _lo = lo + i * bs;

                return _lo .. min(_lo + bs - 1, hi);
        }
        inline proc core(rows : range, columns : range, v : [] R, u : [] R)
        {
                for (r, c) in { rows, columns }
                {
                        v[c, r] = u[r, c];
                }
        }
        inline proc trn(rows : range, columns : range, v : [] R, u : [] R)
        {
                const (m, rlo, rhi) = (rows.size, rows.low, rows.high);
                const (n, clo, chi) = (columns.size, columns.low, columns.high);

                forall (r, c) in { hyperRange(m), hyperRange(n) }
                {
                        core(blockRange(rlo, r, rhi), blockRange(clo, c, chi), v, u);
                }
        }
        inline proc trnserial(rows : range, columns : range, v : [] R, u : [] R)
        {
                const (m, rlo, rhi) = (rows.size, rows.low, rows.high);
                const (n, clo, chi) = (columns.size, columns.low, columns.high);

                for (r, c) in { hyperRange(m), hyperRange(n) }
                {
                        core(blockRange(rlo, r, rhi), blockRange(clo, c, chi), v, u);
                }
        }
        param failure = (-1), success = 0;
        const (rows, columns) = S.dims();
        const (_rows, _columns) = D.dims();
        var endState = success;

        if rows.size == _columns.size && _rows.size == columns.size
        {
                ref t = AT.reindex(columns, rows);

                if S.size < bs * bs then
                {
                        core(rows, columns, t, A);
                }
                else if pa then
                {
                        trn(rows, columns, t, A);
                }
                else
                {
                        trnserial(rows, columns, t, A);
                }
        }
        else
        {
                endState = failure;
        }
        return endState;
}

The above is built around a core routine which handles the small matrix case. It returns zero on success, and non-zero otherwise.

The (a) problem disappears because the small case is done quickly with no fancy parallelism. The (b) problem disappears by using a hypermatrix approach. The above is about 3 or 4 or more times faster than LinearAlgebra's transpose for matrices from 1010 up to 400400..

The idea of using OpenBLAS seems suboptimal. Technically BLIS could be a better choice if the idea is to use some low level C library. But the choice of either ignores the fact that Chapel has the potential to do BLAS operations natively. It should be possible to write such functionality in 4 to 8 times less code than something like OpenBLAS or BLIS needs. That benefit can then be one of the selling points of Chapel. But the performance of that Chapel code needs to be within spitting distance of the performance of what something like BLIS (or OpenBLAS or whatever) provides. There are still some issues before that is achievable with the current Chapel compiler but it needs to be an end-goal of Chapel. In fact I would probably deem Chapel a failure if it cannot get to the point where something like a low level C library for something like BLAS operation is considered. Our 2c.

@ghbrown
Copy link
Contributor Author

ghbrown commented Jan 8, 2023

Do you have a project plan for your work on #17624.

My chain of planned PRs started when I tried to help complete a PR for the pseudoinverse (#17393). If I remember correctly, the generality I wanted to provide here was not supported by Transpose, which turned out to be a Matrix issue. I've since made a PR "fixing" Matrix, so next is Vector, then Transpose. It may be wise to add tests for each of the tasks to see what's really independent and what's not, since I believe Matrix was checked off before it was actually index independent, but this would be quite a bit of work. Besides this my general plan is just to start at the lower level routines and work my way up.

Where is OpenBLAS in all of this?

OpenBLAS has not been discussed much here yet, at least by me. However, when surveying other successful linear algebra libraries (Numpy and Julia's LinearAlgebra) they provide an abstraction that directs calls to BLAS/LAPACK (and possibly other appropriate libraries for sparsity, etc). Depending on the sorts of overheads involved in this approach (I'm unsure about the overheads) can be quite attractive since it has a simple interface and platform optimized routines (if linked to OpenBLAS). Doing much more than setting a platform-independent blocking size in Chapel's LinearAlgebra would likely greatly complicate the code as I don't think there is yet a way to easily access the processor specifications that would allow us to branch to or set different performance constants based on cache sizes, instruction sets, etc.. I should say I would quite like such a library, and it could help in writing a quality Chapel BLAS/LAPACK implementation.

Interestingly, there are a few cases in Julia (including transpose) where they have rolled their own BLAS-like routines, here they compare OpenBLAS to native Julia routines. It appears they do benefit from using Julia-native routines in some cases, including transpose on square matrices. I'm not sure if/how they handle platform optimization of their own routines, or if they choose some reasonable and fixed blocking sizes and let performance be what it will on different architectures.

Feel free to use any of the following ideas that you find useful in a new version of transpose.

My intended changes to the transpose routine were only going to be dead code elimination (reshape case) and ensuring index neutrality. However, as far as performance is concerned I think your approach is quite elegant, and I wouldn't mind having such a routine so long as we could use the aforementioned fixed blockings etc. Alternatively, I heard the Julia folks discussing cache oblivious algorithms, which would be nice as a performant and low maintenance option.

@damianmoz
Copy link

We implemented cache-oblivious alternatives in our work on transpose that worked wonderfully in serial mode but they did not parallelize. Because of that, we did not pursue this approach. But feel free to chase that line if you can make it work.

@damianmoz
Copy link

damianmoz commented Jan 9, 2023

Some miscellaneous points,

Please mention what sized matrices you used for testing. Mine was 22 -> 10001000.

At least for the finite element stuff with which I mess in C, where the matrices are not huge, BLIS and OpenBLAS really do not buy much against well written C code. If there is enough interest, I might publish the incomplete BLAM stuff we use in place of BLIS. But it needs a lot more love and attention.

@bradcray was a bit worried from a performance perspective that our cache-oblivious routine used recursion but Chapel just ate it up so his concerns were misplaced. It is dominated by data movement issues anyway. For our sized matrices, this routine was actually good enough.

Note that we did not look at an in-place transpose if that is what you need.

You can argue that our core routine should have the option to totally unroll say the 8x8 or even the 16x16 cases.

Code like the following which could be used in the core routine,

ref Acore = A[rows, columns];
ref ATcore = AT[columns, rows];

AT = for a in A do a; // copy to the transpose without needing indices.

runs slowly in 1.29.0. However, I believe work will be done on the performance of this sort of expression in the near future.

For BLIS, see also

Van Zee, Field; van de Geijn, Robert (2015). "BLIS: A Framework for Rapidly Instantiating BLAS Functionality". ACM Transactions on Mathematical Software. 41 (3): 1–33. doi:10.1145/2764454.

Van Zee, Field; Smith, Tyler; Igual, Francisco; Smelyanskiy, Mikhail; Zhang, Xiangyi; Kistler, Michael; Austel, Vernon; Gunnels, John; Low, Tze Meng; Marker, Bryan; Killough, Lee; van de Geijn, Robert (2016). "The BLIS Framework: Experiments in Portability". ACM Transactions on Mathematical Software. 42 (2): 1–19. doi:10.1145/2755561

@damianmoz
Copy link

A serial implementation of cache oblivious transpose holds its own against the current naive parallel transpose routine up until about a 200*200 matrix. After that, serial code does not complete with the parallel code. Interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants