Skip to content

Commit 8c11590

Browse files
committed
Updates to H-P Python, Fortran, and C++ short courses
1 parent 6979619 commit 8c11590

File tree

15 files changed

+162
-54
lines changed

15 files changed

+162
-54
lines changed

content/courses/cpp-introduction/setting_up.md

+5-4
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,14 @@ Recently, Microsoft has released the Windows Subsystem for Linux ([WSL](https://
6161
A drawback to both Cygwin and the WSL is portability of executables. Cygwin executables must be able to find the Cygwin DLL and are therefore not standalone.
6262
WSL executables only run on the WSL. For standalone, native binaries a good choice is _MingGW_. MinGW is derived from Cygwin.
6363

64-
MinGW provides a free distribution of gcc/g++/gfortran. The standard MinGW distribution is updated fairly rarely and generates only 32-bit executables. We will describe [MinGW-w64](http://mingw-w64.org/doku.php), a fork of the original project.
64+
MinGW provides a free distribution of gcc/g++/gfortran. The standard MinGW distribution is updated fairly rarely and generates only 32-bit executables. We will describe [MinGW-w64](https://www.mingw-w64.org/), a fork of the original project.
6565
{{< figure src="/courses/cpp-introduction/img/MinGW1.png" width=500px >}}
6666

67-
MinGW-w64 can be installed beginning from the [MSYS2](https://www.msys2.org/) project. MSYS2 provides a significant subset of the Cygwin tools.
68-
Download and install it.
67+
MinGW-w64 can be installed beginning from the [MSYS2](https://www.msys2.org/) project. MSYS2 provides a significant subset of the Cygwin tools. Download and install it.
6968
{{< figure src="/courses/cpp-introduction/img/MSYS2.png" width=500px >}}
70-
Once it has been installed, follow the [instructions](https://www.msys2.org/) to open a command-line tool, update the distribution, then install the compilers and tools.
69+
Once it has been installed, follow the [instructions](https://www.msys2.org/) to open a command-line tool, update the distribution, then install the compilers and tools.
70+
71+
A discussion of installing MinGW-64 compilers for use with VSCode has been posted by Microsoft [here](https://code.visualstudio.com/docs/cpp/config-mingw).
7172

7273
_Intel oneAPI_
7374
First install [Visual Studio](https://visualstudio.microsoft.com/vs/community/).

content/courses/fortran-introduction/setting_up.md

+5-4
Original file line numberDiff line numberDiff line change
@@ -54,13 +54,14 @@ Recently, Microsoft has released the Windows Subsystem for Linux ([WSL](https://
5454
A drawback to both Cygwin and the WSL is portability of executables. Cygwin executables must be able to find the Cygwin DLL and are therefore not standalone.
5555
WSL executables only run on the WSL. For standalone, native binaries a good choice is _MingGW_. MinGW is derived from Cygwin.
5656

57-
MinGW provides a free distribution of gcc/g++/gfortran. The standard MinGW distribution is updated fairly rarely and generates only 32-bit executables. We will describe [MinGW-w64](http://mingw-w64.org/doku.php), a fork of the original project.
57+
MinGW provides a free distribution of gcc/g++/gfortran. The standard MinGW distribution is updated fairly rarely and generates only 32-bit executables. We will describe [MinGW-w64](https://www.mingw-w64.org/), a fork of the original project.
5858
{{< figure src="/courses/fortran-introduction/img/MinGW1.png" width=500px >}}
5959

60-
MinGW-w64 can be installed beginning from the [MSYS2](https://www.msys2.org/) project. MSYS2 provides a significant subset of the Cygwin tools.
61-
Download and install it.
60+
MinGW-w64 can be installed beginning from the [MSYS2](https://www.msys2.org/) project. MSYS2 provides a significant subset of the Cygwin tools. Download and install it.
6261
{{< figure src="/courses/fortran-introduction/img/MSYS2.png" width=500px >}}
63-
Once it has been installed, follow the [instructions](https://www.msys2.org/) to open a command-line tool, update the distribution, then install the compilers and tools.
62+
Once it has been installed, follow the [instructions](https://www.msys2.org/) to open a command-line tool, update the distribution, then install the compilers and tools. For Fortran users, the `mingw64` repository may be preferable to the `ucrt64` repo. To find packages, visit their [repository](https://packages.msys2.org/package/).
63+
64+
A discussion of installing MinGW-64 compilers for use with VSCode has been posted by Microsoft [here](https://code.visualstudio.com/docs/cpp/config-mingw). To use mingw64 rather than ucrt64, simply substitute the text string. Fortran users should install both the C/C++ and Fortran extensions for VSCode.
6465

6566
_Intel oneAPI_
6667
Download and install the basic toolkit and, for Fortran, the HPC toolkit.

content/courses/parallel-computing-introduction/codes/mpi_twod_exchange.py

+26-16
Original file line numberDiff line numberDiff line change
@@ -6,40 +6,44 @@
66
rank = comm.Get_rank()
77
nprocs = comm.Get_size()
88

9-
N = 500
10-
M = 500
9+
N = 400
10+
M = 600
11+
12+
#This example exchanges data among four rectangular domains with halos.
13+
#Most real codes use squares, but we want to illustrate how to use different
14+
#dimensions.
1115

1216
#Divide up the processes. Either we require a perfect square, or we
1317
#must specify how to distribute by row/column. In a realistic program,
1418
#the process distribution (either the total, for a perfect square, or
1519
#the rows/columns) would be read in and we would need to check that the number
1620
#of processes requested is consistent with the decomposition.
1721

18-
nproc_rows=5
19-
nproc_cols=5
22+
nproc_rows=2
23+
nproc_cols=3
2024

2125
if nproc_rows*nproc_cols != nprocs:
2226
print("Number of rows times columns does not equal nprocs")
2327
sys.exit()
2428

2529
#Strong scaling
26-
if N%nprocs==0 and M%nprocs==0:
30+
if N%nproc_rows==0 and M%nproc_cols==0:
2731
nrl = N//nproc_rows
2832
ncl = M//nproc_cols
2933
else:
3034
print("Number of ranks should divide the number of rows evenly.")
3135
sys.exit()
3236

33-
#Weak scaling
34-
#nrl = N
35-
#ncl = M
36-
3737
w = np.zeros((nrl+2, ncl+2), dtype=np.double)
3838

3939
#Set up the topology assuming processes numbered left to right by row
4040

41-
my_row=rank%nproc_rows
42-
my_col=rank%nproc_cols
41+
print("Layout ",nproc_rows,nproc_cols)
42+
43+
my_row=rank%nproc_cols
44+
my_col=rank%nproc_rows
45+
46+
print("Topology ",rank,my_row,my_col)
4347

4448
#Set up boundary conditions
4549
if my_row == 0:
@@ -54,33 +58,36 @@
5458
if my_col == nproc_cols-1:
5559
w[:,ncl+1] = 100. # right
5660

57-
#Arbitarary value for interior that may speed up convergence somewhat.
61+
#Arbitrary value for interior that may speed up convergence somewhat.
5862
#Be sure not to overwrite boundaries.
5963
w[1:nrl+1,1:ncl+1] = 50.
6064

6165
# setting up the up and down rank for each process
6266
if my_row == 0 :
6367
up = MPI.PROC_NULL
6468
else :
65-
up = rank - ncols
69+
up = rank - nproc_cols
6670

6771
if my_row == nprocs - 1 :
6872
down = MPI.PROC_NULL
6973
else :
70-
down = rank + ncols
74+
down = rank + nproc_cols
7175

7276
if my_col == 0 :
7377
left = MPI.PROC_NULL
7478
else:
7579
left = rank-1
7680

77-
if my_col == ncols-1:
81+
if my_col == ncl-1:
7882
right = MPI.PROC_NULL
7983
else:
8084
right = rank+1
8185

86+
print("Upsie downsie ",rank,my_row,my_col,up,down)
87+
8288
# set up MPI vector type for column
83-
column=MPI.DOUBLE.Create_vector(
89+
column=MPI.DOUBLE.Create_vector(nrl,1,ncl)
90+
column.Commit()
8491

8592
tag=0
8693

@@ -90,6 +97,9 @@
9097
comm.Sendrecv([w[nrl,1:ncl+1],MPI.DOUBLE], down, tag, [w[0,1:ncl+1],MPI.DOUBLE], up, tag)
9198

9299
# sending right and left.
100+
comm.Sendrecv((w[0,ncl+1:ncl+2],1,column), right, tag, (w[0:,1],MPI.DOUBLE), left, tag)
101+
102+
comm.Sendrecv((w[0,0:1],1,column), left, tag, (w[0:,ncl],1,MPI.DOUBLE), left, tag)
93103

94104
# Spot-check result
95105
for n in range(nprocs):

content/courses/parallel-computing-introduction/distributed_mpi_types.md

+23-2
Original file line numberDiff line numberDiff line change
@@ -12,17 +12,22 @@ Modern programming languages provide data structures that may be called "structs
1212

1313
MPI also provides a general type that enables programmer-defined datatypes. Unlike arrays, which must be adjacent in memory, MPI derived datatypes may consist of elements in noncontiguous locations in memory.
1414

15-
While more general derived MPI datatypes are available, one of the most commonly used is the `MPI_TYPE_VECTOR`. This creates a group of elements separated by a constant interval, called the _stride_, in memory. Examples would be generating a type for columns in a row-major-oriented language, or rows in a column-major-oriented language.
15+
While more general derived MPI datatypes are available, one of the most commonly used is the `MPI_TYPE_VECTOR`. This creates a group of elements of size _blocklength_ separated by a constant interval, called the _stride_, in memory. Examples would be generating a type for columns in a row-major-oriented language, or rows in a column-major-oriented language.
16+
17+
{{< figure src="/courses/parallel-computing-introduction/img/mpi_vector_type.png" caption="Layout in memory for vector type. In this example, the blocklength is 4, the stride is 6, and the count is 3." >}}
1618

1719
C++
1820
```c++
21+
MPI_Datatype newtype;
1922
MPI_Type_vector(ncount, blocklength, stride, oldtype, newtype);
2023
```
2124
Fortran
2225
```fortran
26+
integer newtype
27+
!code
2328
call MPI_TYPE_VECTOR(ncount, blocklength, stride, oldtype, newtype, ierr)
2429
```
25-
For both C++ and Fortran, `ncount`, `blocklength`, and `stride` must be integers. The `oldtype` is a pre-existing type, usually a built-in MPI Type such as MPI_FLOAT or MPI_REAL. For C++ it would be declared as an `MPI_Datatype`, but if built-ins are used that would be automatic. For Fortran `oldtype` would be an integer if not a built-in type. The `newtype` is a name chosen by the programmer.
30+
For both C++ and Fortran, `ncount`, `blocklength`, and `stride` must be integers. The `oldtype` is a pre-existing type, usually a built-in MPI Type such as MPI_FLOAT or MPI_REAL. For C++ the new type would be declared as an `MPI_Datatype`, unless it corresponds to an existing built-in type. For Fortran `oldtype` would be an integer if not a built-in type. The `newtype` is a name chosen by the programmer.
2631

2732
Python
2833
```python
@@ -43,4 +48,20 @@ Python
4348
newtype.Commit()
4449
```
4550

51+
To use our newly committed type in an MPI communication function, we must pass it the starting position of the data to be placed into the type.
52+
53+
C++
54+
```c++
55+
MPI_Send(&a[0][i],1,newtype,i,MPI_COMM_WORLD);
56+
//We need to pass the first element by reference because an array element
57+
//is not a pointer
58+
```
59+
60+
Fortran
61+
```
62+
MPI_Send(a(1)(i),1,newtype,i,MPI_COMM_WORLD,ierr)
63+
```
64+
65+
66+
4667
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
title: "MPI Vector Type Example"
3+
toc: true
4+
type: docs
5+
weight: 230
6+
menu:
7+
parallel_programming:
8+
parent: Distributed-Memory Programming
9+
---
10+
11+
Our example will construct an $N \times $M$ array of floating-point numbers. In C++ and Python we will exchange the "halo" columns using the MPI type, and the rows in the usual way. In Fortran we will exchange "halo" rows with MPI type and columns with ordinary Sendrecv.
Loading

content/courses/python-high-performance/codes/cupy_example.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,9 @@
88
l2_cpu = np.linalg.norm(x_cpu)
99
l2_gpu = cp.linalg.norm(x_gpu)
1010

11-
print("Using Numpy: ", l2_cpu)
12-
print("\nUsing Cupy: ", l2_gpu)
11+
print("Norm output using Numpy: ", l2_cpu)
12+
print("Norm output Using Cupy: ", l2_gpu)
13+
print()
1314

1415
print("Setting up arrays on host and device")
1516
s = time.time()

content/courses/python-high-performance/codes/dask-scratch-space/global.lock

Whitespace-only changes.

content/courses/python-high-performance/codes/dask-scratch-space/purge.lock

Whitespace-only changes.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Copyright 2008-2021 Andreas Kloeckner
2+
# Copyright 2021 NVIDIA Corporation
3+
4+
from numba import cuda
5+
6+
import pycuda.driver as pycuda
7+
# We use autoprimaryctx instead of autoinit because Numba can only operate on a
8+
# primary context
9+
import pycuda.autoprimaryctx # noqa
10+
import pycuda.gpuarray as gpuarray
11+
12+
import numpy
13+
14+
15+
# Create a PyCUDA gpuarray
16+
a_gpu = gpuarray.to_gpu(numpy.random.randn(4, 4).astype(numpy.float32))
17+
print("original array:")
18+
print(a_gpu)
19+
20+
21+
# A standard Numba kernel that doubles its input array
22+
@cuda.jit
23+
def double(x):
24+
i, j = cuda.grid(2)
25+
26+
if i < x.shape[0] and j < x.shape[1]:
27+
x[i, j] *= 2
28+
29+
30+
# Call the Numba kernel on the PyCUDA gpuarray, using the CUDA Array Interface
31+
# transparently
32+
double[(4, 4), (1, 1)](a_gpu)
33+
print("doubled with numba:")
34+
print(a_gpu)

content/courses/python-high-performance/compiled_code.md

+22-9
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,27 @@ Broadly speaking, interpreted languages tend to be slow, but are relatively easy
1414

1515
Function libraries can be written in C/C++/Fortran and converted into Python-callable modules.
1616

17-
### Fortran
17+
In order to wrap code written in a compiled language, you must have a compiler for the appropriate language installed on your system.
1818

19-
* If you have Fortran source code you can use f2py
20-
* Part of NumPy
21-
* Can work for C as well
22-
* Extremely easy to use
23-
* Can wrap legacy F77 and some newer F90+ (modules are supported)
24-
* Must be used from the command line
19+
#### Windows
20+
21+
If you will not use Fortran, you can install MS Visual Studio. A community edition is available free for personal use and includes C and C++ compilers. If you might use Fortran, a good option is [MinGW-64](https://www.mingw-w64.org/). This may also provide good compatibility with Anaconda even if you do not expect to use Fortran. MinGW-64 provides several options for builds of the `gcc` (Gnu Compiler Collection). The `ucrt` build is recommended but may be a little rough around the edges, at least for Fortran users. The older `mingw64` build may be more suitable. Either or both can be installed on the same system; the path will select the compiler used by Python or the IDE. A nice tutorial on installing MingGW-64 and using it with the free [VSCode IDE](https://code.visualstudio.com/) is [here](https://code.visualstudio.com/docs/cpp/config-mingw). You must install VSCode extensions for C/C++ and, if appropriate, Fortran. To install the mingw64 version, simply substitute that name for ucrt in the `pacman` instructions. For Fortran, after the basic toolchain is installed, run
22+
```no-highlight
23+
pacman -S mingw-w64-x86_64-gcc-fortran
24+
```
25+
Now go to Settings and edit your system environment variables to add `C:\msys2\mingw64\bin` to `path`. Once that is done, you can use a command line or the Anaconda power shell to run f2py as shown below for Linux. After that move the resulting library to an appropriate location in your PYTHONPATH.
26+
27+
#### Mac OS
28+
29+
Install XCode from the Mac App Store for the C/C++ compilers, then if appropriate install gfortran from the [Wiki](https://gcc.gnu.org/wiki/GFortranBinaries). MinGW-64 is also an option for Mac OS. Once installed you can run commands in a Terminal shell. In newer Mac OS versions the shell is `zsh` and not `bash`, but the commands shown for Linux should work without modification.
30+
31+
#### Linux
32+
33+
The gcc compiler should be installed by default but you may have to add the corresponding g++ and gfortran compilers. Refer to the documentation for your Linux distribution and package manager.
34+
35+
### Wrapping Fortran
36+
37+
* If you have Fortran source code you can use f2py. It is included as part of NumPy. It can work for C as well, but requires some knowledge of Fortran interfaces to do so. It can wrap nearly all legacy Fortran 77 and some of the newer Fortran 90 constructs, in particular, modules. It must be used from a command line, which is simple on Linux and Mac OS but a little more complicated on Windows.
2538

2639
http://docs.scipy.org/doc/numpy-dev/f2py/
2740

@@ -50,7 +63,7 @@ One significant weakness of f2py is limited support of the Fortran90+ standards,
5063

5164
It is also possible to wrap the Fortran code in C by various means, such as the F2003 ISO C binding features, then to use the Python-C interface packages, such as ctypes and CFFI, for Fortran. More details are available at [fortran90.org](https://fortran90.org) for interfacing with [C](https://www.fortran90.org/src/best-practices.html#interfacing-with-c) and [Python](https://www.fortran90.org/src/best-practices.html#interfacing-with-python).
5265

53-
### C
66+
### Wrapping C
5467

5568
The [CFFI] (https://cffi.readthedocs.io/en/latest/overview.html) package can be used to wrap C code. CFFI (C Foreign Function Interface) wraps C _libraries_ into Python code. To use it, prepare a shared (dynamic) library of functions. This requires a C compiler, and the exact steps vary depending on your operating system. Windows compilers produce a file called a _DLL_, Unix/Linux shared libraries end in `.so`, and Mac OS shared libraries end in `.dylib`.
5669

@@ -98,7 +111,7 @@ CFFI supports more advanced features. For example, structs can be wrapped into
98111

99112
CFFI does not support C++ directly. Any C++ must be "C-like" and contain an `extern C` declaration.
100113

101-
### C++
114+
### Wrapping C++
102115

103116
One of the most popular packages that deals directly with C++ is [PyBind11](https://pybind11.readthedocs.io/en/stable/). Setting up the bindings is more complex than is the case for ctypes or CFFI, however, and the bindings are written in C++, not Python. Pybind11 will have to be installed through `conda` or `pip`.
104117

content/courses/python-high-performance/gpu_acceleration.md

+21-7
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,17 @@ conda install -c conda-forge cupy
2727
You can also use pip to install CuPy.
2828
Alternatively, use a Docker [container](https://hub.docker.com/r/cupy/cupy/).
2929

30-
You must set the `CUDA_PATH` environment variable for CuPy to be able to accelerate your code properly.
30+
You must set the `CUDA_PATH` environment variable for CuPy to be able to accelerate your code properly. If you are working with your own computer, CUDA is installed from NVIDIA packages.
31+
32+
For example, on a local Linux workstation, NVIDIA installs into `/usr/local/cuda` so you should set this as your CUDA_PATH.
33+
```bash
34+
export CUDA_PATH=/usr/local/cuda
35+
```
36+
Refer to NVIDIA's instructions for other operating systems.
37+
38+
On a system such as UVA's HPC environment, the CUDA module will set the CUDA_PATH environment variable.
3139
```bash
32-
export CUDA_PATH=/usr/local/cuda/bin
40+
module load cuda
3341
```
3442

3543
Methods invoked through the CuPy module will be carried out on the GPU. Corresponding NumPy methods will be processed by the CPU as usual. Data transfer happens through _streams_. The null stream is the default.
@@ -51,14 +59,18 @@ Like CuPy, it is available through conda-forge.
5159
conda install -c conda-forge pycuda
5260
```
5361

54-
On Linux the PATH variable must include the location of the `nvcc` compiler.
62+
On Linux the PATH variable must include the location of the `nvcc` compiler. If you have your own Linux workstation you must first locate nvcc. It should be in the folder indicated by the CUDA_PATH variable, with a "bin" appended.
5563
```bash
56-
export PATH=/usr/local/cuda/bin:$PATH
64+
ls $CUDA_PATH/bin
65+
```
66+
Then add this location to your path, e.g.
67+
```bash
68+
export PATH=$CUDA_PATH/bin:$PATH
5769
```
5870

5971
**Example**
6072

61-
This script is copied directly from PyCUDA's examples.
73+
This script is copied directly from PyCUDA's [examples](https://github.com/berlinguyinca/pycuda/tree/master/examples).
6274
{{% code-download file="/courses/python-high-performance/codes/pycuda_example.py" lang="python" %}}
6375

6476
Much as we saw when discussing using [compiled code](/courses/python-high-performance/compiled_code), we must define our function in C style. This block of code to be executed on the device is called a _kernel_. PyCUDA compiles the kernel, uses its interface with NumPy to allocate memory on the device, copy the Ndarrays, carry out the computation, then copy the result from the device to the `dest` array.
@@ -73,6 +85,9 @@ conda install cudatoolkit
7385
```
7486
If you must use pip, you must also install the [NVIDIA CUDA SDK](https://numba.readthedocs.io/en/stable/user/installing.html).
7587

88+
Numba can be used with PyCUDA so adding it to the PyCUDA environment, which should already contain cudatoolkit, might be advisable. This example is from the PyCUDA [tutorial](https://github.com/berlinguyinca/pycuda/blob/master/doc/source/tutorial.rst).
89+
{{% code-download file="/courses/python-high-performance/codes/pycuda_numba.py" lang="python" %}}
90+
7691
### Numba Vectorization
7792

7893
Numba CUDA can "vectorize" a universal function (ufunc) by compiling it and running it on the GPU. Vectorization is implemented through a decorator.
@@ -84,8 +99,7 @@ For best performance, the signature of the function arguments must be specified.
8499
From the Numba documentation:
85100
{{% code-download file="/courses/python-high-performance/codes/numba_vectorize.py" lang="python" %}}
86101

87-
You may ignore the deprecation warning.
88-
The run may also emit a warning about underutilization:
102+
The run may emit a warning about underutilization:
89103
```no-highlight
90104
Grid size (1) < 2 * SM count (40) will likely result in GPU under utilization due to low occupancy.
91105
```

0 commit comments

Comments
 (0)