CUDA

Physical Architecture (Model of Computation)

Our GPU: 1080 ti or 1080 using the Pascal Architecture. They are similar but an expansion of the Fermi Architecture.
GP 104 SM: Streaming Multi Processor. Each of the 4 blocks is capable of handling a single 32 thread warp simultaneously. So 4 blocks = 4 warps processed in parallel. A Warp runs its piece of the Kernel in lock step, BUT a warp is just a piece of a Thread Block. This is why Thread Blocks must be synchronized. GP104 GPU consisting of 20 SM, each with 4 units, with 32 cores per unit for a total of 20*4*32=2560. rosenbl6@mcs0:~$ samples/NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery/deviceQuery samples/NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "GeForce GTX 1080" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 8120 MBytes (8513978368 bytes) +--------------------------------------------------------------+ |(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores| +--------------------------------------------------------------+ GPU Max Clock rate: 1772 MHz (1.77 GHz) Memory Clock rate: 5005 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 +--------------------------------------------------------------+ |Warp size: 32 | +--------------------------------------------------------------+ Maximum number of threads per multiprocessor: 2048 +--------------------------------------------------------------+ |Maximum number of threads per block: 1024 | +--------------------------------------------------------------+ Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 8 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "GeForce GTX 1080" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 8114 MBytes (8508604416 bytes) (20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores GPU Max Clock rate: 1772 MHz (1.77 GHz) Memory Clock rate: 5005 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 66 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from GeForce GTX 1080 (GPU0) -> GeForce GTX 1080 (GPU1) : Yes > Peer access from GeForce GTX 1080 (GPU1) -> GeForce GTX 1080 (GPU0) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 2 Result = PASS

References

Logical Architecture (Model of Computation)


From: The Fermi Whitepaper pg 6
Host
The CPU and its memory (host memory)
Device
The CUDA GPU and its memory (device memory)
Thread
A single thread of execution
Thread Block
A Thread Block is a group of threads which execute on the same multiprocessor (SMX). Threads within a Thread Block have access to shared memory and can be explicitly synchronized. Currently there is a maximum of 1024 Threads to a Thread Block.

Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores, enabling programmers to write code that scales with the number of cores.

Warp
An implementation issue. Threads in a block are not executed simultaneously. Instead they are broken into Warps (32 threads each). Each warp is executed in parallel in lock step. From: The Fermi Whitepaper pg 10
Grid
A Grid is a collection of Threads. Threads in a Grid execute a Kernel Function and are divided into Thread Blocks.
Host Memory
Separate from GPU Memory. Old CUDA API required malloc in CPU malloc in GPU copy from CPU to GPU compute in GPU copy from GPU to CPU New Cuda API supports Unified Memory: automatic memory sharing between CPU and GPU done through copy on read page faults. Can have some unexpected performance implications. See ...
GPU Memory

C API

References

References