uvarc
diff --git a/‎content/courses/cpp-introduction/setting_up.md
+5-4 b/‎content/courses/cpp-introduction/setting_up.md
+5-4
diff --git a/‎content/courses/fortran-introduction/setting_up.md
+5-4 b/‎content/courses/fortran-introduction/setting_up.md
+5-4
diff --git a/‎content/courses/parallel-computing-introduction/codes/mpi_twod_exchange.py
+26-16 b/‎content/courses/parallel-computing-introduction/codes/mpi_twod_exchange.py
+26-16
diff --git a/‎content/courses/parallel-computing-introduction/distributed_mpi_types.md
+23-2 b/‎content/courses/parallel-computing-introduction/distributed_mpi_types.md
+23-2
diff --git a/‎content/courses/parallel-computing-introduction/distributed_mpi_types_example.md
+11 b/‎content/courses/parallel-computing-introduction/distributed_mpi_types_example.md
+11
diff --git a/‎content/courses/parallel-computing-introduction/img/mpi_vector_type.png
9.4 KB b/‎content/courses/parallel-computing-introduction/img/mpi_vector_type.png
9.4 KB
diff --git a/‎content/courses/python-high-performance/codes/cupy_example.py
+3-2 b/‎content/courses/python-high-performance/codes/cupy_example.py
+3-2
diff --git a/‎content/courses/python-high-performance/codes/dask-scratch-space/global.lock b/‎content/courses/python-high-performance/codes/dask-scratch-space/global.lock
diff --git a/‎content/courses/python-high-performance/codes/dask-scratch-space/purge.lock b/‎content/courses/python-high-performance/codes/dask-scratch-space/purge.lock
diff --git a/‎content/courses/python-high-performance/codes/pycuda_numba.py
+34 b/‎content/courses/python-high-performance/codes/pycuda_numba.py
+34
diff --git a/‎content/courses/python-high-performance/codes/pimpi.sh renamed to ‎content/courses/python-high-performance/codes/pympi.slurm b/‎content/courses/python-high-performance/codes/pimpi.sh renamed to ‎content/courses/python-high-performance/codes/pympi.slurm
diff --git a/‎content/courses/python-high-performance/compiled_code.md
+22-9 b/‎content/courses/python-high-performance/compiled_code.md
+22-9
diff --git a/‎content/courses/python-high-performance/gpu_acceleration.md
+21-7 b/‎content/courses/python-high-performance/gpu_acceleration.md
+21-7
@@ -61,13 +61,14 @@ Recently, Microsoft has released the Windows Subsystem for Linux ([WSL](https://
 A drawback to both Cygwin and the WSL is portability of executables.  Cygwin executables must be able to find the Cygwin DLL and are therefore not standalone.
 WSL executables only run on the WSL.  For standalone, native binaries a good choice is _MingGW_.  MinGW is derived from Cygwin.
 
-MinGW provides a free distribution of gcc/g++/gfortran.  The standard MinGW distribution is updated fairly rarely and generates only 32-bit executables.  We will describe [MinGW-w64](http://mingw-w64.org/doku.php), a fork of the original project.
+MinGW provides a free distribution of gcc/g++/gfortran.  The standard MinGW distribution is updated fairly rarely and generates only 32-bit executables.  We will describe [MinGW-w64](https://www.mingw-w64.org/), a fork of the original project.
 {{< figure src="/courses/cpp-introduction/img/MinGW1.png" width=500px >}}
 
-MinGW-w64 can be installed beginning from the [MSYS2](https://www.msys2.org/) project.  MSYS2 provides a significant subset of the Cygwin tools.
-Download and install it.
+MinGW-w64 can be installed beginning from the [MSYS2](https://www.msys2.org/) project.  MSYS2 provides a significant subset of the Cygwin tools.  Download and install it.
 {{< figure src="/courses/cpp-introduction/img/MSYS2.png" width=500px >}}
-Once it has been installed, follow the [instructions](https://www.msys2.org/) to open a command-line tool, update the distribution, then install the compilers and tools.
+Once it has been installed, follow the [instructions](https://www.msys2.org/) to open a command-line tool, update the distribution, then install the compilers and tools. 
+
+A discussion of installing MinGW-64 compilers for use with VSCode has been posted by Microsoft [here](https://code.visualstudio.com/docs/cpp/config-mingw). 
 
 _Intel oneAPI_
 First install [Visual Studio](https://visualstudio.microsoft.com/vs/community/).
 
@@ -54,13 +54,14 @@ Recently, Microsoft has released the Windows Subsystem for Linux ([WSL](https://
 A drawback to both Cygwin and the WSL is portability of executables.  Cygwin executables must be able to find the Cygwin DLL and are therefore not standalone.
 WSL executables only run on the WSL.  For standalone, native binaries a good choice is _MingGW_.  MinGW is derived from Cygwin.
 
-MinGW provides a free distribution of gcc/g++/gfortran.  The standard MinGW distribution is updated fairly rarely and generates only 32-bit executables.  We will describe [MinGW-w64](http://mingw-w64.org/doku.php), a fork of the original project.
+MinGW provides a free distribution of gcc/g++/gfortran.  The standard MinGW distribution is updated fairly rarely and generates only 32-bit executables.  We will describe [MinGW-w64](https://www.mingw-w64.org/), a fork of the original project.
 {{< figure src="/courses/fortran-introduction/img/MinGW1.png" width=500px >}}
 
-MinGW-w64 can be installed beginning from the [MSYS2](https://www.msys2.org/) project.  MSYS2 provides a significant subset of the Cygwin tools.
-Download and install it.
+MinGW-w64 can be installed beginning from the [MSYS2](https://www.msys2.org/) project.  MSYS2 provides a significant subset of the Cygwin tools.  Download and install it.
 {{< figure src="/courses/fortran-introduction/img/MSYS2.png" width=500px >}}
-Once it has been installed, follow the [instructions](https://www.msys2.org/) to open a command-line tool, update the distribution, then install the compilers and tools.
+Once it has been installed, follow the [instructions](https://www.msys2.org/) to open a command-line tool, update the distribution, then install the compilers and tools. For Fortran users, the `mingw64` repository may be preferable to the `ucrt64` repo. To find packages, visit their [repository](https://packages.msys2.org/package/). 
+
+A discussion of installing MinGW-64 compilers for use with VSCode has been posted by Microsoft [here](https://code.visualstudio.com/docs/cpp/config-mingw). To use mingw64 rather than ucrt64, simply substitute the text string. Fortran users should install both the C/C++ and Fortran extensions for VSCode.
 
 _Intel oneAPI_
 Download and install the basic toolkit and, for Fortran, the HPC toolkit.
 
@@ -6,40 +6,44 @@
 rank = comm.Get_rank()
 nprocs = comm.Get_size()
 
-N = 500
-M = 500
+N = 400
+M = 600
+
+#This example exchanges data among four rectangular domains with halos.
+#Most real codes use squares, but we want to illustrate how to use different
+#dimensions.
 
 #Divide up the processes.  Either we require a perfect square, or we
 #must specify how to distribute by row/column.  In a realistic program,
 #the process distribution (either the total, for a perfect square, or
 #the rows/columns) would be read in and we would need to check that the number
 #of processes requested is consistent with the decomposition.
 
-nproc_rows=5
-nproc_cols=5
+nproc_rows=2
+nproc_cols=3
 
 if nproc_rows*nproc_cols != nprocs:
     print("Number of rows times columns does not equal nprocs")
     sys.exit()
 
 #Strong scaling
-if N%nprocs==0 and M%nprocs==0:
+if N%nproc_rows==0 and M%nproc_cols==0:
     nrl = N//nproc_rows
     ncl = M//nproc_cols
 else:
     print("Number of ranks should divide the number of rows evenly.")
     sys.exit()
 
-#Weak scaling
-#nrl = N
-#ncl = M
-
 w = np.zeros((nrl+2, ncl+2), dtype=np.double)
 
 #Set up the topology assuming processes numbered left to right by row
 
-my_row=rank%nproc_rows
-my_col=rank%nproc_cols
+print("Layout ",nproc_rows,nproc_cols)
+
+my_row=rank%nproc_cols
+my_col=rank%nproc_rows
+
+print("Topology ",rank,my_row,my_col)
 
 #Set up boundary conditions
 if my_row == 0:
@@ -54,33 +58,36 @@
 if my_col == nproc_cols-1:
     w[:,ncl+1] = 100.   # right
 
-#Arbitarary value for interior that may speed up convergence somewhat.
+#Arbitrary value for interior that may speed up convergence somewhat.
 #Be sure not to overwrite boundaries.
 w[1:nrl+1,1:ncl+1] = 50.
 
 # setting up the up and down rank for each process
 if my_row == 0 :
     up = MPI.PROC_NULL
 else :
-    up = rank - ncols
+    up = rank - nproc_cols
 
 if my_row == nprocs - 1 :
     down = MPI.PROC_NULL
 else :
-    down = rank + ncols
+    down = rank + nproc_cols
 
 if my_col == 0 :
     left = MPI.PROC_NULL
 else:
     left = rank-1
 
-if my_col == ncols-1:
+if my_col == ncl-1:
     right = MPI.PROC_NULL
 else:
     right = rank+1
 
+print("Upsie downsie ",rank,my_row,my_col,up,down)
+
 # set up MPI vector type for column
-column=MPI.DOUBLE.Create_vector(
+column=MPI.DOUBLE.Create_vector(nrl,1,ncl)
+column.Commit()
 
 tag=0
 
@@ -90,6 +97,9 @@
 comm.Sendrecv([w[nrl,1:ncl+1],MPI.DOUBLE], down, tag, [w[0,1:ncl+1],MPI.DOUBLE], up, tag)
 
 # sending right and left.
+comm.Sendrecv((w[0,ncl+1:ncl+2],1,column), right, tag, (w[0:,1],MPI.DOUBLE), left, tag)
+
+comm.Sendrecv((w[0,0:1],1,column), left, tag, (w[0:,ncl],1,MPI.DOUBLE), left, tag)
 
 # Spot-check result
 for n in range(nprocs):
 
@@ -12,17 +12,22 @@ Modern programming languages provide data structures that may be called "structs
 
 MPI also provides a general type that enables programmer-defined datatypes. Unlike arrays, which must be adjacent in memory, MPI derived datatypes may consist of elements in noncontiguous locations in memory.
 
-While more general derived MPI datatypes are available, one of the most commonly used is the `MPI_TYPE_VECTOR`. This creates a group of elements separated by a constant interval, called the _stride_, in memory. Examples would be generating a type for columns in a row-major-oriented language, or rows in a column-major-oriented language.  
+While more general derived MPI datatypes are available, one of the most commonly used is the `MPI_TYPE_VECTOR`. This creates a group of elements of size _blocklength_ separated by a constant interval, called the _stride_, in memory. Examples would be generating a type for columns in a row-major-oriented language, or rows in a column-major-oriented language.  
+
+{{< figure src="/courses/parallel-computing-introduction/img/mpi_vector_type.png" caption="Layout in memory for vector type. In this example, the blocklength is 4, the stride is 6, and the count is 3." >}}
 
 C++
 ```c++
+MPI_Datatype newtype;
 MPI_Type_vector(ncount, blocklength, stride, oldtype, newtype);
 ```
 Fortran
 ```fortran
+integer newtype
+!code
 call MPI_TYPE_VECTOR(ncount, blocklength, stride, oldtype, newtype, ierr)
 ```
-For both C++ and Fortran, `ncount`, `blocklength`, and `stride` must be integers. The `oldtype` is a pre-existing type, usually a built-in MPI Type such as MPI_FLOAT or MPI_REAL. For C++ it would be declared as an `MPI_Datatype`, but if built-ins are used that would be automatic.  For Fortran `oldtype` would be an integer if not a built-in type. The `newtype` is a name chosen by the programmer.
+For both C++ and Fortran, `ncount`, `blocklength`, and `stride` must be integers. The `oldtype` is a pre-existing type, usually a built-in MPI Type such as MPI_FLOAT or MPI_REAL. For C++ the new type would be declared as an `MPI_Datatype`, unless it corresponds to an existing built-in type.  For Fortran `oldtype` would be an integer if not a built-in type. The `newtype` is a name chosen by the programmer.
 
 Python
 ```python
@@ -43,4 +48,20 @@ Python
 newtype.Commit()
 ```
 
+To use our newly committed type in an MPI communication function, we must pass it the starting position of the data to be placed into the type.
+
+C++
+```c++
+MPI_Send(&a[0][i],1,newtype,i,MPI_COMM_WORLD);
+//We need to pass the first element by reference because an array element
+//is not a pointer
+```
+
+Fortran
+```
+MPI_Send(a(1)(i),1,newtype,i,MPI_COMM_WORLD,ierr)
+```
+
+
+
 
@@ -0,0 +1,11 @@
+---
+title: "MPI Vector Type Example"
+toc: true
+type: docs
+weight: 230
+menu:
+    parallel_programming:
+        parent: Distributed-Memory Programming
+---
+
+Our example will construct an $N \times $M$ array of floating-point numbers.  In C++ and Python we will exchange the "halo" columns using the MPI type, and the rows in the usual way.  In Fortran we will exchange "halo" rows with MPI type and columns with ordinary Sendrecv.
@@ -8,8 +8,9 @@
 l2_cpu = np.linalg.norm(x_cpu)
 l2_gpu = cp.linalg.norm(x_gpu)
 
-print("Using Numpy: ", l2_cpu)
-print("\nUsing Cupy: ", l2_gpu)
+print("Norm output using Numpy: ", l2_cpu)
+print("Norm output Using Cupy: ", l2_gpu)
+print()
 
 print("Setting up arrays on host and device")
 s = time.time()
 
@@ -0,0 +1,34 @@
+# Copyright 2008-2021 Andreas Kloeckner
+# Copyright 2021 NVIDIA Corporation
+
+from numba import cuda
+
+import pycuda.driver as pycuda
+# We use autoprimaryctx instead of autoinit because Numba can only operate on a
+# primary context
+import pycuda.autoprimaryctx  # noqa
+import pycuda.gpuarray as gpuarray
+
+import numpy
+
+
+# Create a PyCUDA gpuarray
+a_gpu = gpuarray.to_gpu(numpy.random.randn(4, 4).astype(numpy.float32))
+print("original array:")
+print(a_gpu)
+
+
+# A standard Numba kernel that doubles its input array
+@cuda.jit
+def double(x):
+    i, j = cuda.grid(2)
+
+    if i < x.shape[0] and j < x.shape[1]:
+        x[i, j] *= 2
+
+
+# Call the Numba kernel on the PyCUDA gpuarray, using the CUDA Array Interface
+# transparently
+double[(4, 4), (1, 1)](a_gpu)
+print("doubled with numba:")
+print(a_gpu)
@@ -14,14 +14,27 @@ Broadly speaking, interpreted languages tend to be slow, but are relatively easy
 
 Function libraries can be written in C/C++/Fortran and converted into Python-callable modules.  
 
-### Fortran
+In order to wrap code written in a compiled language, you must have a compiler for the appropriate language installed on your system.  
 
-* If you have Fortran source code you can use f2py
-   * Part of NumPy
-   * Can work for C as well
-   * Extremely easy to use
-   * Can wrap legacy F77 and some newer F90+ (modules are supported)
-   * Must be used from the command line
+#### Windows
+
+If you will not use Fortran, you can install MS Visual Studio. A community edition is available free for personal use and includes C and C++ compilers. If you might use Fortran, a good option is [MinGW-64](https://www.mingw-w64.org/). This may also provide good compatibility with Anaconda even if you do not expect to use Fortran.  MinGW-64 provides several options for builds of the `gcc` (Gnu Compiler Collection).  The `ucrt` build is recommended but may be a little rough around the edges, at least for Fortran users.  The older `mingw64` build may be more suitable.  Either or both can be installed on the same system; the path will select the compiler used by Python or the IDE.  A nice tutorial on installing MingGW-64 and using it with the free [VSCode IDE](https://code.visualstudio.com/) is [here](https://code.visualstudio.com/docs/cpp/config-mingw). You must install VSCode extensions for C/C++ and, if appropriate, Fortran. To install the mingw64 version, simply substitute that name for ucrt in the `pacman` instructions. For Fortran, after the basic toolchain is installed, run 
+```no-highlight
+pacman -S mingw-w64-x86_64-gcc-fortran
+```
+Now go to Settings and edit your system environment variables to add `C:\msys2\mingw64\bin` to `path`.  Once that is done, you can use a command line or the Anaconda power shell to run f2py as shown below for Linux. After that move the resulting library to an appropriate location in your PYTHONPATH.
+
+#### Mac OS
+
+Install XCode from the Mac App Store for the C/C++ compilers, then if appropriate install gfortran from the [Wiki](https://gcc.gnu.org/wiki/GFortranBinaries).  MinGW-64 is also an option for Mac OS. Once installed you can run commands in a Terminal shell. In newer Mac OS versions the shell is `zsh` and not `bash`, but the commands shown for Linux should work without modification.
+
+#### Linux
+
+The gcc compiler should be installed by default but you may have to add the corresponding g++ and gfortran compilers. Refer to the documentation for your Linux distribution and package manager.
+
+### Wrapping Fortran
+
+* If you have Fortran source code you can use f2py.  It is included as part of NumPy.  It can work for C as well, but requires some knowledge of Fortran interfaces to do so.  It can wrap nearly all legacy Fortran 77 and some of the newer Fortran 90 constructs, in particular, modules. It must be used from a command line, which is simple on Linux and Mac OS but a little more complicated on Windows. 
 
 http://docs.scipy.org/doc/numpy-dev/f2py/
 
@@ -50,7 +63,7 @@ One significant weakness of f2py is limited support of the Fortran90+ standards,
 
 It is also possible to wrap the Fortran code in C by various means, such as the F2003 ISO C binding features, then to use the Python-C interface packages, such as ctypes and CFFI, for Fortran.  More details are available at [fortran90.org](https://fortran90.org) for interfacing with [C](https://www.fortran90.org/src/best-practices.html#interfacing-with-c) and [Python](https://www.fortran90.org/src/best-practices.html#interfacing-with-python).
 
-### C
+### Wrapping C
 
 The [CFFI] (https://cffi.readthedocs.io/en/latest/overview.html) package can be used to wrap C code.  CFFI (C Foreign Function Interface) wraps C _libraries_ into Python code. To use it, prepare a shared (dynamic) library of functions.  This requires a C compiler, and the exact steps vary depending on your operating system.  Windows compilers produce a file called a _DLL_, Unix/Linux shared libraries end in `.so`, and Mac OS shared libraries end in `.dylib`.  
 
@@ -98,7 +111,7 @@ CFFI supports more advanced features.  For example, structs can be wrapped into
 
 CFFI does not support C++ directly.  Any C++ must be "C-like" and contain an `extern C` declaration.
 
-### C++
+### Wrapping C++
 
 One of the most popular packages that deals directly with C++ is [PyBind11](https://pybind11.readthedocs.io/en/stable/).  Setting up the bindings is more complex than is the case for ctypes or CFFI, however, and the bindings are written in C++, not Python.  Pybind11 will have to be installed through `conda` or `pip`.
 
 
@@ -27,9 +27,17 @@ conda install -c conda-forge cupy
 You can also use pip to install CuPy.
 Alternatively, use a Docker [container](https://hub.docker.com/r/cupy/cupy/).
 
-You must set the `CUDA_PATH` environment variable for CuPy to be able to accelerate your code properly.
+You must set the `CUDA_PATH` environment variable for CuPy to be able to accelerate your code properly. If you are working with your own computer, CUDA is installed from NVIDIA packages. 
+
+For example, on a local Linux workstation, NVIDIA installs into `/usr/local/cuda` so you should set this as your CUDA_PATH.
+```bash
+export CUDA_PATH=/usr/local/cuda
+```
+Refer to NVIDIA's instructions for other operating systems.
+
+On a system such as UVA's HPC environment, the CUDA module will set the CUDA_PATH environment variable.
 ```bash
-export CUDA_PATH=/usr/local/cuda/bin
+module load cuda
 ```
 
 Methods invoked through the CuPy module will be carried out on the GPU.  Corresponding NumPy methods will be processed by the CPU as usual.  Data transfer happens through _streams_.  The null stream is the default.
@@ -51,14 +59,18 @@ Like CuPy, it is available through conda-forge.
 conda install -c conda-forge pycuda
 ```
 
-On Linux the PATH variable must include the location of the `nvcc` compiler.
+On Linux the PATH variable must include the location of the `nvcc` compiler. If you have your own Linux workstation you must first locate nvcc. It should be in the folder indicated by the CUDA_PATH variable, with a "bin" appended.
 ```bash
-export PATH=/usr/local/cuda/bin:$PATH
+ls $CUDA_PATH/bin
+```
+Then add this location to your path, e.g.
+```bash
+export PATH=$CUDA_PATH/bin:$PATH
 ```
 
 **Example**
 
-This script is copied directly from PyCUDA's examples.
+This script is copied directly from PyCUDA's [examples](https://github.com/berlinguyinca/pycuda/tree/master/examples).
 {{% code-download file="/courses/python-high-performance/codes/pycuda_example.py" lang="python" %}}
 
 Much as we saw when discussing using [compiled code](/courses/python-high-performance/compiled_code), we must define our function in C style.  This block of code to be executed on the device is called a _kernel_.  PyCUDA compiles the kernel, uses its interface with NumPy to allocate memory on the device, copy the Ndarrays, carry out the computation, then copy the result from the device to the `dest` array.
@@ -73,6 +85,9 @@ conda install cudatoolkit
 ```
 If you must use pip, you must also install the [NVIDIA CUDA SDK](https://numba.readthedocs.io/en/stable/user/installing.html).
 
+Numba can be used with PyCUDA so adding it to the PyCUDA environment, which should already contain cudatoolkit, might be advisable. This example is from the PyCUDA [tutorial](https://github.com/berlinguyinca/pycuda/blob/master/doc/source/tutorial.rst).
+{{% code-download file="/courses/python-high-performance/codes/pycuda_numba.py" lang="python" %}}
+
 ### Numba Vectorization
 
 Numba CUDA can "vectorize" a universal function (ufunc) by compiling it and running it on the GPU.  Vectorization is implemented through a decorator.
@@ -84,8 +99,7 @@ For best performance, the signature of the function arguments must be specified.
 From the Numba documentation:
 {{% code-download file="/courses/python-high-performance/codes/numba_vectorize.py" lang="python" %}}
 
-You may ignore the deprecation warning.
-The run may also emit a warning about underutilization:
+The run may emit a warning about underutilization:
 ```no-highlight
 Grid size (1) < 2 * SM count (40) will likely result in GPU under utilization due to low occupancy.
 ```