CSC456-2306F High-Performance Scientific Computing


Fall 2006

Course information for current students:

Material covered in lectures: (Page numbers refer to the notes booklet, while other references are sections of various books, mentioned in the course outline handout.)

13-09-06 (2)
1   Introduction
1.1 Motivation for high-performance and parallel computing
1.2 Parallel architectures
    [roughly from Ortega 1.1, see also Foster 3.7.2, Zhu 1.2, Kumar 2.1, 2.3-5]
  * Vector versus parallel
  * Parallel versus distributed
  * SIMD versus MIMD
  * Shared versus local memory
    Def: contention and communication time
    Def: communication diameter
    Def: valence
  * Interconnection networks
  - Completely connected
  - Bus network, ring network
  - Mesh connection
    + 1-dim (linear array), ring
    + 2-dim, 2D torus
    + 3-dim, 3D torus
  - k-ary tree
  - Hypercube (d-dim cube)
  - Butterfly (switching network), cube connected cycles network
  - Shuffle-exchange
  - Omega network
  - Other: hybrid schemes, hierarchical schemes, clusters, etc
  * Mappings between interconnection networks
  - equivalence between a n x log n butterfly (for normal algorithms),
    a n leaves binary tree (for normal algorithms),
    a (log n)-dim cube and a n processor shuffle-exchange
    [Ullman, pgs 219-221]
  - simulation of a k-dim mesh by a d-dim hypercube
    [Bertsekas, Tsitsiklis, pgs 52-54]
18-09-06 (1)
1.3 Some concepts and definitions in parallel computing
    [roughly from Ortega 1.2 (pgs 20-25), see also Zhu 1.3, Foster 3.3,
     Kumar 4.1-2]
  * degree of parallelism of a parallel algorithm 
  * granularity of a parallel algorithm
  * speedup and efficiency of a parallel algorithm on a parallel machine
  * data ready time
  * load balancing
1.4 Simple examples [roughly from Ortega 1.2 and 1.3]
  * adding two vectors of size n
  * summing up n numbers (start)
20-09-06 (2)
  * summing up n numbers (directed sum-up to processor npout or global sum-up)
  * broadcast a number
    [see also Foster 2.3.2, 2.4.1, 2.4.2]
  * inner product of two vectors
  * matrix-vector multiplication (by rows and by columns) [pg 36-38, Kumar 5.3.1]
  * all-to-all broadcast (total exchange) algorithm [Kumar 3.3]
  * global sum-up of n vectors
1.5 Performance study
    Modelling performance - computation time, communication time
    [Foster 3.3, 3.7, Kumar 2.7]
    Obtaining experimental data [Foster 3.5]
    Fitting data to models [Foster 3.5]
25-09-06 (1)
1.6 Measuring and studying speedup and efficiency
    [Ortega 1.2, pgs 25-27, Zhu 1.3, Foster 3.4]
  * speedup based on the sequential time, Amdahl's law
  * speedup based on the parallel time, Gustavson's model
  * scaled (workload) speedup, scaled memory speedup
  * ways to experimentally measure scaled speedup
1.7 Scalability analysis [Foster 3.4, Kumar 4.4]
    Scalability with fixed problem size
27-09-06 (2)
    Scalability with scaled problem size - isoefficiency function
    Efficiency and scalability: an example and some considerations
1.8 MPI
    General
    Example 1
    Send and Receive
    Example 2
    Collective operations
02-10-06 (1)
    Timing
    Example 3

2   Solution of linear systems - Direct methods
2.0 LU factorisation and the Gauss elimination algorithm [Ortega 2.2]
  * The algorithm and its use for solving linear systems
2.1 Medium and coarse grain parallel LU factorisation algorithms [Ortega 2.2]
  * simple model, p = n, row assignment
  * simple model, p = n, column assignment
  * block storage, row assignment
04-10-06 (2)
  * wrapped interleaved storage, row assignment
  * reflection interleaved storage, row assignment
  * Notes
  - communication
  - column assignment
  - shared memory machines
  - dynamic load balancing, pool of tasks
  - send-ahead technique
  - partial pivoting, row or column assignment
2.2 Fine grain LU factorisation - Data Flow algorithm [Ortega 2.2]

11-10-06 (2)
2.3 Symmetric and symmetric positive definite matrices [Ortega 2.2]
    The LDL^T and the Cholesky factorisations
2.4 Triangular systems [Ortega 2.2]
  * ways of viewing the sequential algorithm
  * column sweep  algorithm - row    wrapped interleaved storage
  * inner product algorithm - column wrapped interleaved storage
  * send-ahead and compute-ahead
  * symmetric matrices
  * shared memory machines
2.5 Multiple right side vectors
2.6 Banded systems, sequential banded LU [Ortega 2.3]

16-10-06 (1)
    Banded systems, parallel banded LU - pivoting [Ortega 2.3]
2.7 Triangular banded systems [Ortega 2.3]
2.8 Tridiagonal systems - odd-even and cyclic reduction [Ortega 2.3, pg. 125]

18-10-06 (2)
IV  Gray codes
2.9 Narrow banded systems - Partitioning methods [Ortega 2.3, pg. 114-120]
  - Partitioning Method I
  - Partitioning Method II

II. Boundary Value Problems: an one-dimensional example [Ortega 2.3, pg. 120]
23-10-06 (1)
-.- Boundary Value Problems: a  two-dimensional example [Ortega 3.1, pg. 134-135]
    [also Zhu 2, 2.1]
    discussion on assignment 1

2.10 Domain decomposition - Schur complement methods [Ortega 2.2, pg. 120-125]
     [also Saad 3.1, 3.2, Zhu 2.5.3.3]
  *  Domain decomposition in 1D - ordering - arrowhead matrix (start)
30-10-06 (1)
  *  Domain decomposition in 1D - ordering - arrowhead matrix
  *  General banded matrix - ordering - arrowhead matrix
  *  Schur complement - capacitance - Gauss transform system
  *  Solving the arrowhead system
  *  A parallel domain decomposition - Schur complement method
1-11-06 (2)
  *  Domain decomposition in 2D - ordering - arrowhead matrix

III  Inner products, vector, matrix and function norms

3    Iterative methods for the solution of linear systems
3.1  Introduction - iterative methods - stopping criteria - splittings
     [Ortega 3.1, pg. 133-134, 138-139, see also Saad 4.1]
3.2  Basic iterative methods: Jacobi, Gauss-Seidel, SOR, SSOR
     [Ortega 3.1, pg. 133-134, 3.2, pg. 156-160, see also Saad 4.1]
6-11-06 (1)
3.3  Convergence of iterative linear solvers
     [Ortega 3.1, pg. 134, 3.2, pg. 157-158, see also Saad 4.2]
3.4  The Conjugate Gradient method [Ortega 3.3, see also Saad 6.7, Zhu 2.5.1]
8-11-06 (2)
3.5  Preconditioning [Ortega 3.4, see also Saad 10.1-3]
  *  Incomplete Factorisation preconditioning [Ortega pg. 211-214]
  *  Block diagonal preconditioning
  *  SSOR preconditioning
3.6  The Preconditioned Conjugate Gradient method
     [Ortega 3.4, Saad 9.2, Zhu 2.5.3]
3.7  Parallel Jacobi method - application to the 2D BVP
     [Ortega 3.1, Saad 11.4-6, Zhu 2.2.1, 2.2.3.1]
3.8  Asynchronous iterative methods [Ortega 3.1, pg 138]
13-11-06 (1)
3.9  Block iterative methods - Parallel block Jacobi for the 5-pt-star matrix
     [Ortega 3.1, pg. 145-148, see also Saad 12.2]
3.10 Parallel Conjugate Gradient method - application to the 2D BVP
     [Ortega 3.3, Zhu 2.5.2]
15-11-06 (2)
3.11 The use of CG in solving the Schur complement system
     [Ortega 3.3, pg. 194-195, see also Saad 13.4, Zhu 2.5.3.3]
3.12 Parallel Gauss-Seidel and related methods - application to the 2D BVP
     [Ortega 3.2]
3.13 The red-black ordering - // GS, SOR and SSOR methods for the 5-pt-star mat
     [Ortega 3.2, Saad 12.4, Zhu 2.2.2, 2.2.3.2]
22-11-06 (2)
3.14 Multicolour orderings - // GS, SOR and SSOR methods for the 9-pt-star mat
     [Ortega 3.2, Saad 12.4, Zhu 2.2.3.2]
3.15 The block Gauss-Seidel and related methods for the 5-point-star matrix
     [Ortega 3.2]

4.   Partial Differential Equations and more
4.1  The Multigrid Method
     Two-grid method
29-11-06 (2)
     Multigrid as preconditioning technique
     Extension and restriction operators
     V-cycle, W-cycle and full-multigrid
     Parallel computation of multigrid
     References
4.2  Fourier solvers
     The Discrete Fourier Transform and the Fast Fourier Transform Algorithms
     Solving one-dimensional BVPs using FFTs
6-12-06 (2)
-.-  Tensor products of matrices
-.-  Tensor product form of discrete 2D BVPs arising from FDMs
     Using tensor products to solve matrix problems arising from FDMs
     Solving two-dimensional BVPs using tensor products and FFTs
     Parallel computation of the FFT
     Parallel computation of FFT solvers for two-dimensional problems
     References