CSC456-2306F High-Performance Scientific Computing

Fall 2022 Bulletin board for csc456-2306 fall 2022 -- course outline -- MarkUs

Course information for current students:

Material covered in the course (Corresponds to the notes, while other references are sections of various books, mentioned in the course outline handout.)

8-9-2022 (1 hr)
1   Introduction
1.1 Motivation for high-performance and parallel computing
1.2 Parallel architectures
    [roughly from Ortega 1.1, see also Foster 3.7.2, Zhu 1.2, 2.3-5]
  * Vector versus parallel
  * Parallel versus distributed
  * SIMD versus MIMD
13-9-2022 (2 hrs)
  * Shared versus local memory
    Def: contention and communication time
    Def: communication diameter
    Def: valence
  * Interconnection networks
  - Completely connected
  - Bus network, ring network
  - Mesh connection
    + 1-dim (linear array), ring
    + 2-dim, 2D torus
    + 3-dim, 3D torus
  - k-ary tree
  - Hypercube (d-dim cube)
  - Butterfly (switching network), cube connected cycles network
  - Shuffle-exchange
  - Omega network
  - Other: hybrid schemes, hierarchical schemes, clusters, etc
  * Mappings between interconnection networks
  - equivalence between a n x log n butterfly (for normal algorithms),
    a n leaves binary tree (for normal algorithms),
    a (log n)-dim cube and a n processor shuffle-exchange
    [Ullman, pgs 219-221]
  - simulation of a k-dim mesh by a d-dim hypercube
    [Bertsekas, Tsitsiklis, pgs 52-54, Kumar 2.7]
1.3 Some concepts and definitions in parallel computing
    [roughly from Ortega 1.2 (pgs 20-25), see also Zhu 1.3, Foster 3.3,
     Kumar 3.1, 5.1-3]
  * degree of parallelism of a parallel algorithm
  * granularity of a parallel algorithm
  * speedup and efficiency of a parallel algorithm on a parallel machine
  * data ready time
  * load balancing
1.4 Simple examples [roughly from Ortega 1.2 and 1.3]
  * adding two vectors of size n
  * summing up n numbers (directed sum-up to processor npout or global sum-up)
  * broadcast a number
    [see also Foster 2.3.2, 2.4.1, 2.4.2]
  * inner product of two vectors
15-9-2022 (1 hr)
  * matrix-vector multiplication (by rows and by columns) [pg 36-38, Kumar Ex 2.5, Ex 3.1]
  * all-to-all broadcast (total exchange) algorithm [Kumar 6.6]
  * global sum-up of n vectors
20-9-2022 (2 hrs)
1.8 MPI
    General
    Example 1 (test0c.c)
    Send and Receive
    Example 2 (test1c.c)
    Collective operations
    Timing in MPI
    Example 3 (test3c.c)

1.5 Performance study
    Modelling performance - computation time, communication time
    [Foster 3.3, 3.7, Kumar 2.5]
    Obtaining experimental data [Foster 3.5]
    Fitting data to models [Foster 3.5]
1.6 Measuring and studying speedup and efficiency
    [Ortega 1.2, pgs 25-27, Zhu 1.3, Foster 3.4]
  * speedup based on the sequential time, Amdahl's law
  * speedup based on the parallel time, Gustavson's model
  * scaled (workload) speedup, scaled memory speedup
  * ways to experimentally measure scaled speedup
1.7 Scalability analysis [Foster 3.4, Kumar 4.4]
22-9-2022 (1 hr)
    Scalability with fixed problem size
    Scalability with scaled problem size - isoefficiency function
    Efficiency and scalability: an example and some considerations
1.9 GPU computing
    History
    GPU architecture, CUDA API, limits, CUDA C (start)
27-9-2022 (2 hrs)
    GPU architecture, CUDA API, limits, CUDA C
    Example 1 - Example with shared memory, Dynamic allocation
    Timing in CUDA

2   Solution of linear systems - Direct methods
2.0 LU factorisation and the Gauss elimination algorithm [Ortega 2.2]
  * The algorithm and its use for solving linear systems
2.1 Medium and coarse grain parallel LU factorisation algorithms [Ortega 2.2]
  * simple model, p = n, row assignment
  * simple model, p = n, column assignment
  * block storage, row assignment
  * wrapped interleaved storage, row assignment
  * reflection interleaved storage, row assignment
  * Notes
  - communication
  - column assignment
  - shared memory machines
  - dynamic load balancing, pool of tasks
  - send-ahead technique
  - partial pivoting, row or column assignment
29-9-2022 (1 hr)
2.2 Fine grain LU factorisation - Data Flow algorithm [Ortega 2.2]
2.3 Symmetric and symmetric positive definite matrices [Ortega 2.2]
    The LDL^T and the Cholesky factorisations
4-10-2022 (2 hrs)
2.4 Triangular systems [Ortega 2.2]
  * ways of viewing the sequential algorithm
  * column sweep  algorithm - row    wrapped interleaved storage
  * inner product algorithm - column wrapped interleaved storage
  * send-ahead and compute-ahead
  * symmetric matrices
  * shared memory machines
2.5 Multiple right side vectors
2.6 Banded systems, sequential banded LU [Ortega 2.3]
    Banded systems, parallel banded LU - pivoting [Ortega 2.3]
2.7 Triangular banded systems [Ortega 2.3]
2.8 Tridiagonal systems - odd-even and cyclic reduction [Ortega 2.3, pg. 125]
6-10-2022 (1 hr)
2.8 Tridiagonal systems - odd-even and cyclic reduction [Ortega 2.3, pg. 125]
    end
2.9 Narrow banded systems - Partitioning methods [Ortega 2.3, pg. 114-120]
  - Partitioning Method I
11-10-2022 (2 hrs)
  - Partitioning Method I (end)
  - Partitioning Method II

II. Boundary Value Problems: an one-dimensional example [Ortega 2.3, pg. 120]
-.- Boundary Value Problems: a  two-dimensional example [Ortega 3.1, pg. 134-135]

13-10-2022 (1 hr)
2.10 Domain decomposition - Schur complement methods [Ortega 2.2, pg. 120-125]
     [also Saad 3.1, 3.2, Zhu 2.5.3.3]
  *  Domain decomposition in 1D - ordering - arrowhead matrix
  *  General banded matrix - ordering - arrowhead matrix
  *  Schur complement - capacitance - Gauss transform system
  *  Solving the arrowhead system
  *  A parallel domain decomposition - Schur complement method
  *  Domain decomposition in 2D - ordering - arrowhead matrix
     size of reduced system, bandwidth of blocks

18-10-2022 (2 hrs)
  *  Domain decomposition in 2D - ordering - arrowhead matrix
     size of reduced system, bandwidth of blocks (end)
III  Inner products, vector, matrix and function norms
     condition number of matrix

3    Iterative methods for the solution of linear systems
3.1  Introduction - iterative methods - stopping criteria - splittings
     [Ortega 3.1, pg. 133-134, 138-139, see also Saad 4.1]
20-10-2022 (1 hr)
Discussion on A1
25-10-2022 (2 hrs)
Midterm test
27-10-2022 (1 hr)
Discussion on midterm
3.2  Basic iterative methods: Jacobi, Gauss-Seidel, SOR, SSOR
     [Ortega 3.1, pg. 133-134, 3.2, pg. 156-160, see also Saad 4.1]
3.3  Convergence of iterative linear solvers
1-11-2022 (2 hrs)
3.3  Convergence of iterative linear solvers (end)
     [Ortega 3.1, pg. 134, 3.2, pg. 157-158, see also Saad 4.2]
3.4  The Conjugate Gradient method [Ortega 3.3, see also Saad 6.7, Zhu 2.5.1]
3.5  Preconditioning [Ortega 3.4, see also Saad 10.1-3]
  *  Incomplete Factorisation preconditioning [Ortega pg. 211-214]
3-11-2022 (1 hr)
  *  Block diagonal preconditioning
  *  SSOR preconditioning
3.6  The Preconditioned Conjugate Gradient method
     [Ortega 3.4, Saad 9.2, Zhu 2.5.3]
3.7  Parallel Jacobi method - application to the 2D BVP
     [Ortega 3.1, Saad 11.4-6, Zhu 2.2.1, 2.2.3.1]
Fall break (reading week)
15-11-2022 (2 hrs)
3.8  Asynchronous iterative methods [Ortega 3.1, pg 138]
3.9  Block iterative methods - Parallel block Jacobi for the 5-pt-star matrix
     [Ortega 3.1, pg. 145-148, see also Saad 12.2]
3.10 Parallel Conjugate Gradient method - application to the 2D BVP
     [Ortega 3.3, Zhu 2.5.2]
3.11 The use of CG in solving the Schur complement system
     [Ortega 3.3, pg. 194-195, see also Saad 13.4, Zhu 2.5.3.3]
3.12 Parallel Gauss-Seidel and related methods - application to the 2D BVP
     [Ortega 3.2]
17-11-2022 (2 hrs)
3.13 The red-black ordering - // GS, SOR and SSOR methods for the 5-pt-star mat
     [Ortega 3.2, Saad 12.4, Zhu 2.2.2, 2.2.3.2]
22-11-2022 (2 hrs)
3.14 Multicolour orderings - // GS, SOR and SSOR methods for the 9-pt-star mat
     [Ortega 3.2, Saad 12.4, Zhu 2.2.3.2]
3.15 The block Gauss-Seidel and related methods for the 5-point-star matrix
     [Ortega 3.2]

0V. Tensor product of matrices
-.- Tensor product form of discrete 2D BVPs arising from FDMs
24-11-2022 (1 hr)
-.- Diagonalization and block-diagonalization of matrices

4.2  Fourier solvers
     The DFT matrix
29-11-2022 (2 hrs)
     The Discrete Fourier Transform and the Fast Fourier Transform Algorithms
     Solving one-dimensional BVPs using FFTs
     Solving two-dimensional BVPs using tensor products and FFTs
1-12-2022 (1 hr)
     Parallel computation of the FFT
     Parallel computation of FFT solvers for two-dimensional problems
4.4  Parallel data (block matrix) transposition
6-12-2022 (1.5 hrs)
4.4  Parallel data (block matrix) transposition (end)

0-IV Gray codes

Summary

Note and handouts: Course information Outline

Access to the data below requires that you type in your CDF (teaching lab) username (as login) and last 5 digits of your student number (as pass). This password (for accessing the website) cannot be changed.

Lecture notes

Assignments