Explore tens of thousands of sets crafted by our community.
CUDA Basics for GPU Parallel Programming
29
Flashcards
0/29
Thread Hierarchy
CUDA organizes threads in a hierarchy, including threads, blocks, warps, and grids. Threads within the same block can communicate via shared memory, while different blocks are generally independent.
Managed Memory
Managed memory in CUDA refers to the Unified Memory feature, which provides a single memory space accessible from both the CPU and GPU, simplifying data management between host and device.
Constant Memory
Constant memory is a small, fast, read-only cache in CUDA that provides low-latency access to the same data for all threads. It is best used for data that does not change over the course of a kernel execution.
Warp
A warp is a group of 32 threads in CUDA that execute the same instruction at the same time. Warps are crucial for achieving full efficiency on GPU architectures.
Shared Memory
Shared memory in CUDA refers to a small bank of memory on the GPU that is shared among threads within the same block, providing a method for threads to communicate and share data swiftly.
CUDA Event API
The CUDA Event API allows for fine-grained performance measurement by letting developers record the time between events on the GPU, facilitating the optimization of CUDA applications.
What is CUDA?
Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model created by NVIDIA for general computing on graphical processing units (GPUs). It allows developers to use 'C for CUDA' to directly create software that can run on GPUs.
Atomic Operations
Atomic operations in CUDA are used to prevent data races when different threads attempt to read and write to the same memory location simultaneously. They ensure these operations are completed without interference.
Grids and Blocks
In CUDA, a grid is a collection of blocks that can execute kernels independently. The dimensionality of both grids and blocks can be one, two, or three-dimensional, to map to problem space efficiently.
NVCC
NVIDIA's CUDA Compiler (NVCC) is a tool that compiles CUDA code into PTX (Parallel Thread Execution) instructions that can be executed on a CUDA-enabled GPU.
CUDA Kernels
In CUDA, kernels are functions that you define in the code which run in parallel on the GPU. They are launched with a specific number of threads, organized into blocks and grids.
PTX
Parallel Thread Execution (PTX) is an intermediate representation of CUDA programs that enables forward compatibility with future hardware by delaying the final machine code generation until runtime.
Compiler Optimizations
Compiler optimizations in CUDA involve NVCC transforming the CUDA code to enhance performance, such as loop unrolling, inlining functions, and optimizing memory accesses.
Register Pressure
Register pressure in CUDA programming occurs when there are too many live variables that exceed the number of available registers in a thread, potentially leading to spilling and reduced performance.
Texture Memory
Texture memory in CUDA is a read-only cache that can be used to reduce memory read transactions when threads access spatially localized data, taking advantage of spatial locality.
Global Memory
Global memory is the largest and slowest form of memory accessible to all threads in a CUDA program. It is off-chip and has a high-latency, and so should be used carefully to avoid bottlenecks.
Memory Coalescing
Memory coalescing refers to the process of combining memory accesses by threads in a warp to reduce the number of transactions with the global memory, leading to improved performance.
CUDA Streams
CUDA streams are sequences of operations that execute in order on the GPU. Different streams can execute operations concurrently, making it possible to overlap computations with memory transfers.
Dynamic Parallelism
Dynamic parallelism in CUDA allows kernels to be launched from within other kernels running on the GPU, enabling more complex, nested parallel computations without returning to the CPU.
Streaming Multiprocessors (SMs)
In CUDA, Streaming Multiprocessors (SMs) are the fundamental computing units of an NVIDIA GPU, where the warps of threads are actually executed.
CUDA Occupancy
CUDA occupancy is the ratio of active warps to the maximum number of warps supported on a SM at any one time, affecting the utilization and efficiency of GPU resources.
CUDA Math Libraries
CUDA provides a suite of math libraries, such as cuBLAS and cuFFT, which offer optimized implementations of mathematical functions and algorithms for use on GPUs.
Kernel Launch Parameters
Kernel launch parameters in CUDA specify the execution configuration for a kernel, including the number of blocks, the number of threads per block, and the amount of shared memory to be allocated.
Unified Memory Advancements
Unified Memory advancements in recent CUDA versions have improved support for automatic memory migration, coherency, and granularity, making programming for GPUs more similar to conventional programming.
Synchronization Functions
Synchronization functions in CUDA, such as __syncthreads(), allow threads within the same block to coordinate by waiting until all threads have reached the same point in the code.
Asynchronous Data Transfer
Asynchronous data transfer in CUDA allows data transfers to occur concurrently with kernel execution, minimizing idle time and improving the overall performance of the application.
Error Handling
Error handling in CUDA involves checking the status of kernel launches and API calls, using functions such as cudaGetErrorString() to diagnose issues within CUDA applications.
Launch Bounds
Launch bounds in CUDA are used to specify the minimum number of blocks and threads that should be used for a given kernel, which helps optimize register usage and increase occupancy.
Peer-to-Peer Memory Copy
Peer-to-peer memory copy allows for direct memory copies between GPUs without involving CPU, reducing the overhead and improving the memory transfer performance between multiple GPUs.
© Hypatia.Tech. 2024 All rights reserved.