Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel linalg #67

Open
certik opened this issue Jan 2, 2020 · 12 comments
Open

Parallel linalg #67

certik opened this issue Jan 2, 2020 · 12 comments
Labels
topic: mathematics linear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ...

Comments

@certik
Copy link
Member

certik commented Jan 2, 2020

The modern Fortran API for a serial linear algebra (#10) seems natural.

How would that be extended to work in parallel using co-arrays? If there is a similar "natural" parallel API for linear algebra using modern Fortran, then that would be a good candidate for inclusion into stdlib, and we can have different backends that do the work (Scalapack, ..., perhaps even our own simpler reference implementation using co-arrays directly), that way if somebody writes a faster 3rd party library, then it could be plugged in as a backend, and user codes do not need to change, because they would already be using the stdlib API for parallel linear algebra.

@certik
Copy link
Member Author

certik commented Jan 2, 2020

@zbeekman you have a lot of experience with co-arrays, is there a way to do this?

This was referenced Jan 2, 2020
@jvdp1
Copy link
Member

jvdp1 commented Jan 2, 2020

The modern Fortran API for a serial linear algebra (#10) seems natural.

Would this API also include shared-memory parallelization? Especially if it is based on BLAS/LAPACK.

@certik
Copy link
Member Author

certik commented Jan 2, 2020

Would this API also include shared-memory parallelization?

In the above I was thinking of distributed memory parallelization (MPI, co-arrays, ...).

What are the options for shared-memory parallelization in Fortran? I am aware of do concurrent and openmp. It seems to me, and I could be wrong, that in terms of utility, the distributed memory is the most useful. It can still be run on shared-memory computer (i.e., a single node), but it can also be run on an HPC cluster. Most of the codes that I have been working with use MPI, but rarely they use OpenMP. That being said, I did write an OpenMP version of CSR matmul in my code and it gives about 2x to 4x speedup on 32 cores.... So terrible performance, but expected, since it is memory bound. I do not have an MPI version of CSR matmul, but it would run faster I would expect, due to the memory being distributed (on each core).

@zbeekman
Copy link
Member

zbeekman commented Jan 2, 2020

In general, one should be able to implement parallel LA algorithms using coarrays. The coarray implementation may be shared memory, distributed memory, hybrid, etc. the standard doesn't specify. Part of the point of coarrays is to have a simpler API and programming model that can be divorced from the underlying implementation.

The trickier question is, perhaps, what should the interface look like? How much ownership and control should the client code have over the objects? Should the user create and pass coarrays? Or should there be a global array view that makes it appear as though you're working with normal arrays?

Last I checked there were some non-trivial issues with the coarray specification in the standard that makes them challenging or impossible to use in some applications, especially computations on unstructured meshes and some other graph and graph-like algorithms. I don't recall the details and I believe Salvatore Filippone (PSBLAS author) submitted a proposal to J3 to resolve it, or at least to highlight the issue in the standard.

Intel provides a shared-memory coarray implementation on some platforms with some licenses, if I remember correctly. I think without parallel studio cluster edition, the Intel Fortran compiler has a shared memory coarray implementation. If you have the license for cluster edition I think that unlocks the MPI back end (or at least the SDK/compile time stuff).

Using coarrays is nice because it abstracts away the backend. OpenCoarrays main backend is MPI, but we have an experimental/partial one based on OpenSHMEM, and at one point in the past we were using GASNet. So I think coarrays are a natural and good choice for parallelism, but a few issues remain:

  • Support from compiler vendors, especially for a bunch of things like events and collectives for stuff that didn't make it into the 2008 standard
  • Outstanding issues with asymmetric coarrays as I alluded to above

OpenMP is nice because of its built in conditional compilation and support for GPUs and accelerators. Thread affinity and avoiding other threading issues is certainly tricky, however.

@ivan-pi
Copy link
Member

ivan-pi commented Jan 2, 2020

The book by Numrich - Parallel Programming with Co-arrays discusses an API for both sparse and dense linear algebra using co-arrays.

I know that for PSBLAS they recently developed a co-array backend. A recent article discusses the topic (a draft is available somewhere on GitHub).

@zbeekman
Copy link
Member

zbeekman commented Jan 2, 2020

If we can use or adopt parts of PSBLAS that would be nice, rather than reinventing the wheel.

@certik
Copy link
Member Author

certik commented Jan 2, 2020

@zbeekman I was lead to believe at the latest J3 meeting that co-arrays can be used today with GFortran, Intel and Cray for anything that MPI can be used, including unstructured meshes (that was my first question to them). But I haven't used co-arrays myself yet.

My understanding is also that you can mix and match co-arrays with MPI, is that correct?

I would go ahead and try to figure out what the API should look like using co-arrays, and if we like it, we can work towards putting it into stdlib. If we can't agree on a good way due to fundamental limitations of co-arrays, then let's submit proposals to the J3 committee to fix it.

I would think exposing co-arrays directly to the user would be the natural way lowest level API, similarly to the serial linalg API that just operates on arrays. Then, we can always see if there is some optional good higher level API, whether object oriented, or some global object (state?), similarly to how there can be an optional OO API on top of the serial linalg. Let's brainstorm this more on some example.

@ivan-pi thanks for the pointers --- both links contain very useful info. They have done a lot of thinking about this, so we should see if we can use their API.

@jvdp1
Copy link
Member

jvdp1 commented Jan 2, 2020

Most of the codes that I have been working with use MPI, but rarely they use OpenMP. That being said, I did write an OpenMP version of CSR matmul in my code and it gives about 2x to 4x speedup on 32 cores....

@certik I usually rely on Sparse BLAS for such operations (http://www.netlib.org/utk/people/JackDongarra/etemplates/node381.html), mainly with the MKL version.

@jvdp1
Copy link
Member

jvdp1 commented Jan 2, 2020

The book by Numrich - Parallel Programming with Co-arrays discusses an API for both sparse and dense linear algebra using co-arrays.

I think it would be a good start.
@ivan-pi Do you know if the library on which the book is based, is available somewhere? Many articles by Numrich mentioned it, but I am not sure if it has been released.

@ivan-pi
Copy link
Member

ivan-pi commented Jan 2, 2020

@ivan-pi Do you know if the library on which the book is based, is available somewhere? Many articles by Numrich mentioned it, but I am not sure if it has been released.

I have not found the library anywhere and the book also doesn't offer any link. The book mostly contains only the subroutine prototypes and a description of the variables and some discussion of the API design.

@zbeekman
Copy link
Member

zbeekman commented Jan 6, 2020

@zbeekman I was lead to believe at the latest J3 meeting that co-arrays can be used today with GFortran, Intel and Cray for anything that MPI can be used, including unstructured meshes (that was my first question to them). But I haven't used co-arrays myself yet.

Yes, this is more or less true. However, I don't remember the particular issue, however I recall that @sfilippone found a subtlety with the standard that caused a large headache/impediment in realizing more complex data structures/machinery needed for unstructured meshes. I cannot immediately recall the details. Maybe the OpenCoarrays repo has issues discussing this or maybe Salvatore can remind me here.

My understanding is also that you can mix and match co-arrays with MPI, is that correct?

Yes, in theory this should be true. One complication is that if coarrays are implemented via MPI, then the compiler provided Fortran runtime is responsible for initializing MPI. This may not be ideal in certain situations. I think we implemented a configure time option in OpenCoarrays to return the global communicator to the user or delay MPI_init() and let the user call it. I'd have to double check.

@sfilippone
Copy link

sfilippone commented Jan 7, 2020 via email

@jvdp1 jvdp1 added the topic: mathematics linear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ... label Jan 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: mathematics linear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ...
Projects
None yet
Development

No branches or pull requests

5 participants