Course information for current students:
Important note on the use of bulletin boards:
No parts or whole of answers to the assignment problems
should be posted to the boards.
Any violation of this rule will bring trouble to the poster.
Please use judgement before posting.
Material covered in the course
(Corresponds to the notes,
while other references are sections of various books,
mentioned in the course outline handout.)
13-1-2021 (2 hrs) 1 Introduction 1.1 Motivation for high-performance and parallel computing 1.2 Parallel architectures [roughly from Ortega 1.1, see also Foster 3.7.2, Zhu 1.2, 2.3-5] * Vector versus parallel * Parallel versus distributed * SIMD versus MIMD * Shared versus local memory Def: contention and communication time Def: communication diameter Def: valence * Interconnection networks - Completely connected - Bus network, ring network - Mesh connection + 1-dim (linear array), ring + 2-dim, 2D torus + 3-dim, 3D torus - k-ary tree - Hypercube (d-dim cube) - Butterfly (switching network), cube connected cycles network - Shuffle-exchange - Omega network - Other: hybrid schemes, hierarchical schemes, clusters, etc * Mappings between interconnection networks - equivalence between a n x log n butterfly (for normal algorithms), a n leaves binary tree (for normal algorithms), a (log n)-dim cube and a n processor shuffle-exchange [Ullman, pgs 219-221] - simulation of a k-dim mesh by a d-dim hypercube [Bertsekas, Tsitsiklis, pgs 52-54, Kumar 2.7] 1.3 Some concepts and definitions in parallel computing [roughly from Ortega 1.2 (pgs 20-25), see also Zhu 1.3, Foster 3.3, Kumar 3.1, 5.1-3] * degree of parallelism of a parallel algorithm * granularity of a parallel algorithm * speedup and efficiency of a parallel algorithm on a parallel machine * data ready time * load balancing 1.4 Simple examples [roughly from Ortega 1.2 and 1.3] * adding two vectors of size n * summing up n numbers (directed sum-up to processor npout or global sum-up) * broadcast a number [see also Foster 2.3.2, 2.4.1, 2.4.2] 14-1-2021 (1 hr) * inner product of two vectors * matrix-vector multiplication (by rows and by columns) [pg 36-38, Kumar Ex 2.5, Ex 3.1] * all-to-all broadcast (total exchange) algorithm [Kumar 6.6] * global sum-up of n vectors 1.5 Performance study Modelling performance - computation time, communication time [Foster 3.3, 3.7, Kumar 2.5] 20-1-2021 (2 hrs) Obtaining experimental data [Foster 3.5] Fitting data to models [Foster 3.5] 1.6 Measuring and studying speedup and efficiency [Ortega 1.2, pgs 25-27, Zhu 1.3, Foster 3.4, Kumar 5.4] * speedup based on the sequential time, Amdahl's law * speedup based on the parallel time, Gustavson's model * scaled (workload) speedup, scaled memory speedup * experimentally measuring scaled speedup 1.7 Scalability analysis [Foster 3.4, Kumar 5.4-6] Scalability with fixed problem size Scalability with scaled problem size - isoefficiency function Efficiency and scalability: an example and some considerations 1.8 MPI General Example 1 (test0c.c) Send and Receive Example 2 (test1c.c) Collective operations Timing in MPI Example 3 (test3c.c) 1.9 GPU computing History GPU architecture, CUDA API, limits, CUDA C Example 1 - Example with shared memory - Timing in CUDAHandouts: Outline