Course information for current students:
Material covered in lectures: (Page numbers refer to the notes booklet, while other references are sections of various books, mentioned in the course outline handout.)
13-09-06 (2)
1 Introduction
1.1 Motivation for high-performance and parallel computing
1.2 Parallel architectures
[roughly from Ortega 1.1, see also Foster 3.7.2, Zhu 1.2, Kumar 2.1, 2.3-5]
* Vector versus parallel
* Parallel versus distributed
* SIMD versus MIMD
* Shared versus local memory
Def: contention and communication time
Def: communication diameter
Def: valence
* Interconnection networks
- Completely connected
- Bus network, ring network
- Mesh connection
+ 1-dim (linear array), ring
+ 2-dim, 2D torus
+ 3-dim, 3D torus
- k-ary tree
- Hypercube (d-dim cube)
- Butterfly (switching network), cube connected cycles network
- Shuffle-exchange
- Omega network
- Other: hybrid schemes, hierarchical schemes, clusters, etc
* Mappings between interconnection networks
- equivalence between a n x log n butterfly (for normal algorithms),
a n leaves binary tree (for normal algorithms),
a (log n)-dim cube and a n processor shuffle-exchange
[Ullman, pgs 219-221]
- simulation of a k-dim mesh by a d-dim hypercube
[Bertsekas, Tsitsiklis, pgs 52-54]
18-09-06 (1)
1.3 Some concepts and definitions in parallel computing
[roughly from Ortega 1.2 (pgs 20-25), see also Zhu 1.3, Foster 3.3,
Kumar 4.1-2]
* degree of parallelism of a parallel algorithm
* granularity of a parallel algorithm
* speedup and efficiency of a parallel algorithm on a parallel machine
* data ready time
* load balancing
1.4 Simple examples [roughly from Ortega 1.2 and 1.3]
* adding two vectors of size n
* summing up n numbers (start)
20-09-06 (2)
* summing up n numbers (directed sum-up to processor npout or global sum-up)
* broadcast a number
[see also Foster 2.3.2, 2.4.1, 2.4.2]
* inner product of two vectors
* matrix-vector multiplication (by rows and by columns) [pg 36-38, Kumar 5.3.1]
* all-to-all broadcast (total exchange) algorithm [Kumar 3.3]
* global sum-up of n vectors
1.5 Performance study
Modelling performance - computation time, communication time
[Foster 3.3, 3.7, Kumar 2.7]
Obtaining experimental data [Foster 3.5]
Fitting data to models [Foster 3.5]
25-09-06 (1)
1.6 Measuring and studying speedup and efficiency
[Ortega 1.2, pgs 25-27, Zhu 1.3, Foster 3.4]
* speedup based on the sequential time, Amdahl's law
* speedup based on the parallel time, Gustavson's model
* scaled (workload) speedup, scaled memory speedup
* ways to experimentally measure scaled speedup
1.7 Scalability analysis [Foster 3.4, Kumar 4.4]
Scalability with fixed problem size
27-09-06 (2)
Scalability with scaled problem size - isoefficiency function
Efficiency and scalability: an example and some considerations
1.8 MPI
General
Example 1
Send and Receive
Example 2
Collective operations
02-10-06 (1)
Timing
Example 3
2 Solution of linear systems - Direct methods
2.0 LU factorisation and the Gauss elimination algorithm [Ortega 2.2]
* The algorithm and its use for solving linear systems
2.1 Medium and coarse grain parallel LU factorisation algorithms [Ortega 2.2]
* simple model, p = n, row assignment
* simple model, p = n, column assignment
* block storage, row assignment
04-10-06 (2)
* wrapped interleaved storage, row assignment
* reflection interleaved storage, row assignment
* Notes
- communication
- column assignment
- shared memory machines
- dynamic load balancing, pool of tasks
- send-ahead technique
- partial pivoting, row or column assignment
2.2 Fine grain LU factorisation - Data Flow algorithm [Ortega 2.2]
11-10-06 (2)
2.3 Symmetric and symmetric positive definite matrices [Ortega 2.2]
The LDL^T and the Cholesky factorisations
2.4 Triangular systems [Ortega 2.2]
* ways of viewing the sequential algorithm
* column sweep algorithm - row wrapped interleaved storage
* inner product algorithm - column wrapped interleaved storage
* send-ahead and compute-ahead
* symmetric matrices
* shared memory machines
2.5 Multiple right side vectors
2.6 Banded systems, sequential banded LU [Ortega 2.3]
16-10-06 (1)
Banded systems, parallel banded LU - pivoting [Ortega 2.3]
2.7 Triangular banded systems [Ortega 2.3]
2.8 Tridiagonal systems - odd-even and cyclic reduction [Ortega 2.3, pg. 125]
18-10-06 (2)
IV Gray codes
2.9 Narrow banded systems - Partitioning methods [Ortega 2.3, pg. 114-120]
- Partitioning Method I
- Partitioning Method II
II. Boundary Value Problems: an one-dimensional example [Ortega 2.3, pg. 120]
23-10-06 (1)
-.- Boundary Value Problems: a two-dimensional example [Ortega 3.1, pg. 134-135]
[also Zhu 2, 2.1]
discussion on assignment 1
2.10 Domain decomposition - Schur complement methods [Ortega 2.2, pg. 120-125]
[also Saad 3.1, 3.2, Zhu 2.5.3.3]
* Domain decomposition in 1D - ordering - arrowhead matrix (start)
30-10-06 (1)
* Domain decomposition in 1D - ordering - arrowhead matrix
* General banded matrix - ordering - arrowhead matrix
* Schur complement - capacitance - Gauss transform system
* Solving the arrowhead system
* A parallel domain decomposition - Schur complement method
1-11-06 (2)
* Domain decomposition in 2D - ordering - arrowhead matrix
III Inner products, vector, matrix and function norms
3 Iterative methods for the solution of linear systems
3.1 Introduction - iterative methods - stopping criteria - splittings
[Ortega 3.1, pg. 133-134, 138-139, see also Saad 4.1]
3.2 Basic iterative methods: Jacobi, Gauss-Seidel, SOR, SSOR
[Ortega 3.1, pg. 133-134, 3.2, pg. 156-160, see also Saad 4.1]
6-11-06 (1)
3.3 Convergence of iterative linear solvers
[Ortega 3.1, pg. 134, 3.2, pg. 157-158, see also Saad 4.2]
3.4 The Conjugate Gradient method [Ortega 3.3, see also Saad 6.7, Zhu 2.5.1]
8-11-06 (2)
3.5 Preconditioning [Ortega 3.4, see also Saad 10.1-3]
* Incomplete Factorisation preconditioning [Ortega pg. 211-214]
* Block diagonal preconditioning
* SSOR preconditioning
3.6 The Preconditioned Conjugate Gradient method
[Ortega 3.4, Saad 9.2, Zhu 2.5.3]
3.7 Parallel Jacobi method - application to the 2D BVP
[Ortega 3.1, Saad 11.4-6, Zhu 2.2.1, 2.2.3.1]
3.8 Asynchronous iterative methods [Ortega 3.1, pg 138]
13-11-06 (1)
3.9 Block iterative methods - Parallel block Jacobi for the 5-pt-star matrix
[Ortega 3.1, pg. 145-148, see also Saad 12.2]
3.10 Parallel Conjugate Gradient method - application to the 2D BVP
[Ortega 3.3, Zhu 2.5.2]
15-11-06 (2)
3.11 The use of CG in solving the Schur complement system
[Ortega 3.3, pg. 194-195, see also Saad 13.4, Zhu 2.5.3.3]
3.12 Parallel Gauss-Seidel and related methods - application to the 2D BVP
[Ortega 3.2]
3.13 The red-black ordering - // GS, SOR and SSOR methods for the 5-pt-star mat
[Ortega 3.2, Saad 12.4, Zhu 2.2.2, 2.2.3.2]
22-11-06 (2)
3.14 Multicolour orderings - // GS, SOR and SSOR methods for the 9-pt-star mat
[Ortega 3.2, Saad 12.4, Zhu 2.2.3.2]
3.15 The block Gauss-Seidel and related methods for the 5-point-star matrix
[Ortega 3.2]
4. Partial Differential Equations and more
4.1 The Multigrid Method
Two-grid method
29-11-06 (2)
Multigrid as preconditioning technique
Extension and restriction operators
V-cycle, W-cycle and full-multigrid
Parallel computation of multigrid
References
4.2 Fourier solvers
The Discrete Fourier Transform and the Fast Fourier Transform Algorithms
Solving one-dimensional BVPs using FFTs
6-12-06 (2)
-.- Tensor products of matrices
-.- Tensor product form of discrete 2D BVPs arising from FDMs
Using tensor products to solve matrix problems arising from FDMs
Solving two-dimensional BVPs using tensor products and FFTs
Parallel computation of the FFT
Parallel computation of FFT solvers for two-dimensional problems
References