CUDA

Physical Architecture (Model of Computation)

Our GPU: 1080 ti or 1080 using the Pascal Architecture. They are similar but an expansion of the Fermi Architecture.

GP 104 SM: Streaming Multi Processor. Each of the 4 blocks is capable of handling a single 32 thread warp simultaneously. So 4 blocks = 4 warps processed in parallel. A Warp runs its piece of the Kernel in lock step, BUT a warp is just a piece of a Thread Block. This is why Thread Blocks must be synchronized.

GP104 GPU consisting of 20 SM, each with 4 units, with 32 cores per unit for a total of 20*4*32=2560. rosenbl6@mcs0:~$ samples/NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery/deviceQuery samples/NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "GeForce GTX 1080" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 8120 MBytes (8513978368 bytes) +--------------------------------------------------------------+ |(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores| +--------------------------------------------------------------+ GPU Max Clock rate: 1772 MHz (1.77 GHz) Memory Clock rate: 5005 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 +--------------------------------------------------------------+ |Warp size: 32 | +--------------------------------------------------------------+ Maximum number of threads per multiprocessor: 2048 +--------------------------------------------------------------+ |Maximum number of threads per block: 1024 | +--------------------------------------------------------------+ Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 8 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "GeForce GTX 1080" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 8114 MBytes (8508604416 bytes) (20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores GPU Max Clock rate: 1772 MHz (1.77 GHz) Memory Clock rate: 5005 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 66 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from GeForce GTX 1080 (GPU0) -> GeForce GTX 1080 (GPU1) : Yes > Peer access from GeForce GTX 1080 (GPU1) -> GeForce GTX 1080 (GPU0) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 2 Result = PASS

References

Pascal Architecture

Logical Architecture (Model of Computation)

From: The Fermi Whitepaper pg 6

Host

The CPU and its memory (host memory)

Device

The CUDA GPU and its memory (device memory)

Thread

A single thread of execution

Thread Block

A Thread Block is a group of threads which execute on the same multiprocessor (SMX). Threads within a Thread Block have access to shared memory and can be explicitly synchronized. Currently there is a maximum of 1024 Threads to a Thread Block.

Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores, enabling programmers to write code that scales with the number of cores.

Warp

An implementation issue. Threads in a block are not executed simultaneously. Instead they are broken into Warps (32 threads each). Each warp is executed in parallel in lock step.

From: The Fermi Whitepaper pg 10

Grid

A Grid is a collection of Threads. Threads in a Grid execute a Kernel Function and are divided into Thread Blocks.

Host Memory

Separate from GPU Memory. Old CUDA API required malloc in CPU malloc in GPU copy from CPU to GPU compute in GPU copy from GPU to CPU New Cuda API supports Unified Memory: automatic memory sharing between CPU and GPU done through copy on read page faults. Can have some unexpected performance implications. See ...

GPU Memory

Thread: local memory
Thread Block: Shared memory
Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the __syncthreads() intrinsic function; __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed. Shared Memory gives an example of using shared memory. In addition to __syncthreads(), the Cooperative Groups API provides a rich set of thread-synchronization primitives. For efficient cooperation, the shared memory is expected to be a low-latency memory near each processor core (much like an L1 cache) and __syncthreads() is expected to be lightweight.
Application: Global memory

C API

nvcc is the cuda compiler. Considers your C program as two distinct, interleaved pieces. One part executes on the GPU, the other Executes on the CPU. Splits your program and compiles the CUDA part with NVIDIA's C compiler and the CPU part with the native C compiler.
nvprof To take a look at the performance of your code.
First introduction:
An EvenSimpler Introduction