CSC456-2306F High-Performance Scientific Computing

Fall 2022 Bulletin board for csc456-2306 fall 2022 -- course outline -- MarkUs

Course information for current students:

2022-12-02 Office hours for A3: Tuesday 6-12-2022, 4-5 PM, Wednesday 7-12-2022, 3-4 PM (note the change, no regular, I need to leave at 4), Thursday 8-12-2022, 2-3 PM.
2022-11-27: A2 is marked, and available on MarkUs. Please come to class, as some details discussed may be relevant to A3 as well.
2022-11-21: Assignment 3 is posted.
2022-11-12: Assignment 2 has been updated. In Q4, m = 2n-1. Also choose the domain of the problem as in Q3. The deadline has also been updated to Wed Nov 16.
Extra office hours Tuesday 15 Nov 2022, 3:30-4:30 PM.
2022-11-10: Assignment 2 has been updated. Please reload/refresh. The main change is the formula for lambda_1 in Q4. (The correct formula has power of 2 for the sines.) Also, the use of l's instead of lambdas has been corrected in Q4, and, in the correct version, only lambdas show up. Finally, in Q1, you can assume p divides n.
2022-11-07
Office hours for the reading week: Wednesday 2:30-3:30, Thursday 2-3, Friday 1-2.
2022-10-26 The midterm has been graded and marks are available on the cdf student secure login site. Note: Midterm marks are NOT available on MarkUs.
2022-10-18 The midterm is Tue 25 Oct 2022, 1-3 PM. Please sit sparsely.
The material for the midterm includes the introduction (1=intro) and the first part of direct solvers (2a=direct). You can consider these practice questions. Furthermore, Q1 and Q2 from A1, and Q1 from A2, are appropriate questions for an exam.
Extra office hours: Thursday 3:30-4:30, Friday 11:15-12:15, Monday 5-6 (besides the regular Wednesday 3:30-4:30). If no time fits you, send me e-mail.
2022-10-18 Assignment 2 is posted.
2022-10-05 Clarifications on A1.
2022-10-04 Extra office hours: Friday 7 Oct 2022, 11:15-12:15 (besides the regular of Wed 3:30-4:30). If you want other times, send me e-mail, and we will arrange.
2022-09-21 When you run mpi, give the full path name of the binary file, for example, if the directory you have your files is ~/456/mpi/, and the binary executable is test0c, use
% mpirun -np 2 -machinefile m ~/456/mpi/test0c
and NOT
% mpirun -np 2 -machinefile m test0c
2022-09-16 Assignment 1 is posted. Q1 can be done right now, the material for Q3 will be presented this Tuesday, and just a little later the material for Q2.
2022-09-16 My regular office hours will be Wednesday 3:45-4:45, BA 4226, but more office hours will be posted depending on assignments and test. I am also available with prior arrangement at other times. Also available through Zoom.
2022-09-16 The course outline has been updated, so please re-load. See the new deadline for A1.
The first meeting of the course is Thursday, September 8, 2022, 2-3 PM.
No classes the week 7-11 November 2022 (Fall break / Reading week -- A&S calendar).
I expect that the Thursday spots (tutorials), 2-3 PM, will be used for lectures. We may cancel some tutorials, at the end of the course, if it looks like we finish with the course material.
Bulletin board for CSC456-2306 Fall 2022
Important note on the use of bulletin boards: No parts of or whole answers to the assignment/exam problems should be posted to the boards (or anywhere else), even after the assignment is due. Any violation of this rule will bring trouble to the poster.
Please use judgement before posting.
All assignments and exams are to be done individually by each student. See the course outline about academic integrity and additional information.
Here is a latex example file, with associated files spyalt.eps, spyblock.eps, trochoid1.m, assign.bib, to compile correctly, and output 456a1.pdf. You can use it for assignments, but you may also use you own latex template. Please always use font size 12 and linespread 1.1, as shown in the sample file.
CDF/Teaching Labs: Please see http://www.cdf.toronto.edu or http://www.teach.cs.toronto.edu (they are the same), especially the Resources, Intro to new students tab, and the Using Labs, Remote Access Server tab.
To access parts of the course website (e.g. notes, etc), you need to use your *CDF* (teaching lab) username (same as UTorId) and the last 5 digits of your student number. This password (for accessing the website) cannot be changed.
Accounts will be created on CDF. You will need these accounts to run MPI programs on various workstations. See 456accts for CDF server and workstation names.
A brief introduction to MATLAB, Christina C. Christara and Winky Wai
(for those who do not know MATLAB, though MATLAB is not essential)
Fluency in C or C++ or Fortran is essential.

Material covered in the course (Corresponds to the notes, while other references are sections of various books, mentioned in the course outline handout.)

8-9-2022 (1 hr)
1   Introduction
1.1 Motivation for high-performance and parallel computing
1.2 Parallel architectures
    [roughly from Ortega 1.1, see also Foster 3.7.2, Zhu 1.2, 2.3-5]
  * Vector versus parallel
  * Parallel versus distributed
  * SIMD versus MIMD
13-9-2022 (2 hrs)
  * Shared versus local memory
    Def: contention and communication time
    Def: communication diameter
    Def: valence
  * Interconnection networks
  - Completely connected
  - Bus network, ring network
  - Mesh connection
    + 1-dim (linear array), ring
    + 2-dim, 2D torus
    + 3-dim, 3D torus
  - k-ary tree
  - Hypercube (d-dim cube)
  - Butterfly (switching network), cube connected cycles network
  - Shuffle-exchange
  - Omega network
  - Other: hybrid schemes, hierarchical schemes, clusters, etc
  * Mappings between interconnection networks
  - equivalence between a n x log n butterfly (for normal algorithms),
    a n leaves binary tree (for normal algorithms),
    a (log n)-dim cube and a n processor shuffle-exchange
    [Ullman, pgs 219-221]
  - simulation of a k-dim mesh by a d-dim hypercube
    [Bertsekas, Tsitsiklis, pgs 52-54, Kumar 2.7]
1.3 Some concepts and definitions in parallel computing
    [roughly from Ortega 1.2 (pgs 20-25), see also Zhu 1.3, Foster 3.3,
     Kumar 3.1, 5.1-3]
  * degree of parallelism of a parallel algorithm
  * granularity of a parallel algorithm
  * speedup and efficiency of a parallel algorithm on a parallel machine
  * data ready time
  * load balancing
1.4 Simple examples [roughly from Ortega 1.2 and 1.3]
  * adding two vectors of size n
  * summing up n numbers (directed sum-up to processor npout or global sum-up)
  * broadcast a number
    [see also Foster 2.3.2, 2.4.1, 2.4.2]
  * inner product of two vectors
15-9-2022 (1 hr)
  * matrix-vector multiplication (by rows and by columns) [pg 36-38, Kumar Ex 2.5, Ex 3.1]
  * all-to-all broadcast (total exchange) algorithm [Kumar 6.6]
  * global sum-up of n vectors
20-9-2022 (2 hrs)
1.8 MPI
    General
    Example 1 (test0c.c)
    Send and Receive
    Example 2 (test1c.c)
    Collective operations
    Timing in MPI
    Example 3 (test3c.c)

1.5 Performance study
    Modelling performance - computation time, communication time
    [Foster 3.3, 3.7, Kumar 2.5]
    Obtaining experimental data [Foster 3.5]
    Fitting data to models [Foster 3.5]
1.6 Measuring and studying speedup and efficiency
    [Ortega 1.2, pgs 25-27, Zhu 1.3, Foster 3.4]
  * speedup based on the sequential time, Amdahl's law
  * speedup based on the parallel time, Gustavson's model
  * scaled (workload) speedup, scaled memory speedup
  * ways to experimentally measure scaled speedup
1.7 Scalability analysis [Foster 3.4, Kumar 4.4]
22-9-2022 (1 hr)
    Scalability with fixed problem size
    Scalability with scaled problem size - isoefficiency function
    Efficiency and scalability: an example and some considerations
1.9 GPU computing
    History
    GPU architecture, CUDA API, limits, CUDA C (start)
27-9-2022 (2 hrs)
    GPU architecture, CUDA API, limits, CUDA C
    Example 1 - Example with shared memory, Dynamic allocation
    Timing in CUDA

2   Solution of linear systems - Direct methods
2.0 LU factorisation and the Gauss elimination algorithm [Ortega 2.2]
  * The algorithm and its use for solving linear systems
2.1 Medium and coarse grain parallel LU factorisation algorithms [Ortega 2.2]
  * simple model, p = n, row assignment
  * simple model, p = n, column assignment
  * block storage, row assignment
  * wrapped interleaved storage, row assignment
  * reflection interleaved storage, row assignment
  * Notes
  - communication
  - column assignment
  - shared memory machines
  - dynamic load balancing, pool of tasks
  - send-ahead technique
  - partial pivoting, row or column assignment
29-9-2022 (1 hr)
2.2 Fine grain LU factorisation - Data Flow algorithm [Ortega 2.2]
2.3 Symmetric and symmetric positive definite matrices [Ortega 2.2]
    The LDL^T and the Cholesky factorisations
4-10-2022 (2 hrs)
2.4 Triangular systems [Ortega 2.2]
  * ways of viewing the sequential algorithm
  * column sweep  algorithm - row    wrapped interleaved storage
  * inner product algorithm - column wrapped interleaved storage
  * send-ahead and compute-ahead
  * symmetric matrices
  * shared memory machines
2.5 Multiple right side vectors
2.6 Banded systems, sequential banded LU [Ortega 2.3]
    Banded systems, parallel banded LU - pivoting [Ortega 2.3]
2.7 Triangular banded systems [Ortega 2.3]
2.8 Tridiagonal systems - odd-even and cyclic reduction [Ortega 2.3, pg. 125]
6-10-2022 (1 hr)
2.8 Tridiagonal systems - odd-even and cyclic reduction [Ortega 2.3, pg. 125]
    end
2.9 Narrow banded systems - Partitioning methods [Ortega 2.3, pg. 114-120]
  - Partitioning Method I
11-10-2022 (2 hrs)
  - Partitioning Method I (end)
  - Partitioning Method II

II. Boundary Value Problems: an one-dimensional example [Ortega 2.3, pg. 120]
-.- Boundary Value Problems: a  two-dimensional example [Ortega 3.1, pg. 134-135]

13-10-2022 (1 hr)
2.10 Domain decomposition - Schur complement methods [Ortega 2.2, pg. 120-125]
     [also Saad 3.1, 3.2, Zhu 2.5.3.3]
  *  Domain decomposition in 1D - ordering - arrowhead matrix
  *  General banded matrix - ordering - arrowhead matrix
  *  Schur complement - capacitance - Gauss transform system
  *  Solving the arrowhead system
  *  A parallel domain decomposition - Schur complement method
  *  Domain decomposition in 2D - ordering - arrowhead matrix
     size of reduced system, bandwidth of blocks

18-10-2022 (2 hrs)
  *  Domain decomposition in 2D - ordering - arrowhead matrix
     size of reduced system, bandwidth of blocks (end)
III  Inner products, vector, matrix and function norms
     condition number of matrix

3    Iterative methods for the solution of linear systems
3.1  Introduction - iterative methods - stopping criteria - splittings
     [Ortega 3.1, pg. 133-134, 138-139, see also Saad 4.1]
20-10-2022 (1 hr)
Discussion on A1
25-10-2022 (2 hrs)
Midterm test
27-10-2022 (1 hr)
Discussion on midterm
3.2  Basic iterative methods: Jacobi, Gauss-Seidel, SOR, SSOR
     [Ortega 3.1, pg. 133-134, 3.2, pg. 156-160, see also Saad 4.1]
3.3  Convergence of iterative linear solvers
1-11-2022 (2 hrs)
3.3  Convergence of iterative linear solvers (end)
     [Ortega 3.1, pg. 134, 3.2, pg. 157-158, see also Saad 4.2]
3.4  The Conjugate Gradient method [Ortega 3.3, see also Saad 6.7, Zhu 2.5.1]
3.5  Preconditioning [Ortega 3.4, see also Saad 10.1-3]
  *  Incomplete Factorisation preconditioning [Ortega pg. 211-214]
3-11-2022 (1 hr)
  *  Block diagonal preconditioning
  *  SSOR preconditioning
3.6  The Preconditioned Conjugate Gradient method
     [Ortega 3.4, Saad 9.2, Zhu 2.5.3]
3.7  Parallel Jacobi method - application to the 2D BVP
     [Ortega 3.1, Saad 11.4-6, Zhu 2.2.1, 2.2.3.1]
Fall break (reading week)
15-11-2022 (2 hrs)
3.8  Asynchronous iterative methods [Ortega 3.1, pg 138]
3.9  Block iterative methods - Parallel block Jacobi for the 5-pt-star matrix
     [Ortega 3.1, pg. 145-148, see also Saad 12.2]
3.10 Parallel Conjugate Gradient method - application to the 2D BVP
     [Ortega 3.3, Zhu 2.5.2]
3.11 The use of CG in solving the Schur complement system
     [Ortega 3.3, pg. 194-195, see also Saad 13.4, Zhu 2.5.3.3]
3.12 Parallel Gauss-Seidel and related methods - application to the 2D BVP
     [Ortega 3.2]
17-11-2022 (2 hrs)
3.13 The red-black ordering - // GS, SOR and SSOR methods for the 5-pt-star mat
     [Ortega 3.2, Saad 12.4, Zhu 2.2.2, 2.2.3.2]
22-11-2022 (2 hrs)
3.14 Multicolour orderings - // GS, SOR and SSOR methods for the 9-pt-star mat
     [Ortega 3.2, Saad 12.4, Zhu 2.2.3.2]
3.15 The block Gauss-Seidel and related methods for the 5-point-star matrix
     [Ortega 3.2]

0V. Tensor product of matrices
-.- Tensor product form of discrete 2D BVPs arising from FDMs
24-11-2022 (1 hr)
-.- Diagonalization and block-diagonalization of matrices

4.2  Fourier solvers
     The DFT matrix
29-11-2022 (2 hrs)
     The Discrete Fourier Transform and the Fast Fourier Transform Algorithms
     Solving one-dimensional BVPs using FFTs
     Solving two-dimensional BVPs using tensor products and FFTs
1-12-2022 (1 hr)
     Parallel computation of the FFT
     Parallel computation of FFT solvers for two-dimensional problems
4.4  Parallel data (block matrix) transposition
6-12-2022 (1.5 hrs)
4.4  Parallel data (block matrix) transposition (end)

0-IV Gray codes

Summary

Note and handouts: Course information Outline

Access to the data below requires that you type in your CDF (teaching lab) username (as login) and last 5 digits of your student number (as pass). This password (for accessing the website) cannot be changed.

Lecture notes

Assignments

CSC456-2306F High-Performance Scientific Computing

Fall 2022 Bulletin board for csc456-2306 fall 2022 -- course outline -- MarkUs

The first meeting of the course is Thursday, September 8, 2022, 2-3 PM.