CSC456-2306S High-Performance Scientific Computing

Winter 2021 Bulletin board for csc456-2306 winter 2021 -- course outline

Course information for current students:

Material covered in the course (Corresponds to the notes, while other references are sections of various books, mentioned in the course outline handout.)

13-1-2021 (2 hrs)
1   Introduction
1.1 Motivation for high-performance and parallel computing
1.2 Parallel architectures
    [roughly from Ortega 1.1, see also Foster 3.7.2, Zhu 1.2, 2.3-5]
  * Vector versus parallel
  * Parallel versus distributed
  * SIMD versus MIMD
  * Shared versus local memory
    Def: contention and communication time
    Def: communication diameter
    Def: valence
  * Interconnection networks
  - Completely connected
  - Bus network, ring network
  - Mesh connection
    + 1-dim (linear array), ring
    + 2-dim, 2D torus
    + 3-dim, 3D torus
  - k-ary tree
  - Hypercube (d-dim cube)
  - Butterfly (switching network), cube connected cycles network
  - Shuffle-exchange
  - Omega network
  - Other: hybrid schemes, hierarchical schemes, clusters, etc
  * Mappings between interconnection networks
  - equivalence between a n x log n butterfly (for normal algorithms),
    a n leaves binary tree (for normal algorithms),
    a (log n)-dim cube and a n processor shuffle-exchange
    [Ullman, pgs 219-221]
  - simulation of a k-dim mesh by a d-dim hypercube
    [Bertsekas, Tsitsiklis, pgs 52-54, Kumar 2.7]
1.3 Some concepts and definitions in parallel computing
    [roughly from Ortega 1.2 (pgs 20-25), see also Zhu 1.3, Foster 3.3,
     Kumar 3.1, 5.1-3]
  * degree of parallelism of a parallel algorithm
  * granularity of a parallel algorithm
  * speedup and efficiency of a parallel algorithm on a parallel machine
  * data ready time
  * load balancing
1.4 Simple examples [roughly from Ortega 1.2 and 1.3]
  * adding two vectors of size n
  * summing up n numbers (directed sum-up to processor npout or global sum-up)
  * broadcast a number
    [see also Foster 2.3.2, 2.4.1, 2.4.2]
14-1-2021 (1 hr)
  * inner product of two vectors
  * matrix-vector multiplication (by rows and by columns) [pg 36-38, Kumar Ex 2.5, Ex 3.1]
  * all-to-all broadcast (total exchange) algorithm [Kumar 6.6]
  * global sum-up of n vectors
1.5 Performance study
    Modelling performance - computation time, communication time
    [Foster 3.3, 3.7, Kumar 2.5]
20-1-2021 (2 hrs)
    Obtaining experimental data [Foster 3.5]
    Fitting data to models [Foster 3.5]
1.6 Measuring and studying speedup and efficiency
    [Ortega 1.2, pgs 25-27, Zhu 1.3, Foster 3.4, Kumar 5.4]
  * speedup based on the sequential time, Amdahl's law
  * speedup based on the parallel time, Gustavson's model
  * scaled (workload) speedup, scaled memory speedup
  * experimentally measuring scaled speedup
1.7 Scalability analysis [Foster 3.4, Kumar 5.4-6]
    Scalability with fixed problem size
    Scalability with scaled problem size - isoefficiency function
    Efficiency and scalability: an example and some considerations
1.8 MPI
    General
    Example 1 (test0c.c)
    Send and Receive
    Example 2 (test1c.c)
    Collective operations
    Timing in MPI
    Example 3 (test3c.c)
1.9 GPU computing
    History
    GPU architecture, CUDA API, limits, CUDA C
    Example 1 - Example with shared memory - Timing in CUDA

Handouts: Outline
To access these notes use your cdf (teaching labs) username, and the last 5 digits of your student id number as password. This pass cannot be reset.

Lecture notes

Recordings