Logo
Pattern

Discover published sets by community

Explore tens of thousands of sets crafted by our community.

CUDA Basics for GPU Parallel Programming

29

Flashcards

0/29

Still learning
StarStarStarStar

Thread Hierarchy

StarStarStarStar

CUDA organizes threads in a hierarchy, including threads, blocks, warps, and grids. Threads within the same block can communicate via shared memory, while different blocks are generally independent.

StarStarStarStar

Managed Memory

StarStarStarStar

Managed memory in CUDA refers to the Unified Memory feature, which provides a single memory space accessible from both the CPU and GPU, simplifying data management between host and device.

StarStarStarStar

Constant Memory

StarStarStarStar

Constant memory is a small, fast, read-only cache in CUDA that provides low-latency access to the same data for all threads. It is best used for data that does not change over the course of a kernel execution.

StarStarStarStar

Warp

StarStarStarStar

A warp is a group of 32 threads in CUDA that execute the same instruction at the same time. Warps are crucial for achieving full efficiency on GPU architectures.

StarStarStarStar

Shared Memory

StarStarStarStar

Shared memory in CUDA refers to a small bank of memory on the GPU that is shared among threads within the same block, providing a method for threads to communicate and share data swiftly.

StarStarStarStar

CUDA Event API

StarStarStarStar

The CUDA Event API allows for fine-grained performance measurement by letting developers record the time between events on the GPU, facilitating the optimization of CUDA applications.

StarStarStarStar

What is CUDA?

StarStarStarStar

Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model created by NVIDIA for general computing on graphical processing units (GPUs). It allows developers to use 'C for CUDA' to directly create software that can run on GPUs.

StarStarStarStar

Atomic Operations

StarStarStarStar

Atomic operations in CUDA are used to prevent data races when different threads attempt to read and write to the same memory location simultaneously. They ensure these operations are completed without interference.

StarStarStarStar

Grids and Blocks

StarStarStarStar

In CUDA, a grid is a collection of blocks that can execute kernels independently. The dimensionality of both grids and blocks can be one, two, or three-dimensional, to map to problem space efficiently.

StarStarStarStar

NVCC

StarStarStarStar

NVIDIA's CUDA Compiler (NVCC) is a tool that compiles CUDA code into PTX (Parallel Thread Execution) instructions that can be executed on a CUDA-enabled GPU.

StarStarStarStar

CUDA Kernels

StarStarStarStar

In CUDA, kernels are functions that you define in the code which run in parallel on the GPU. They are launched with a specific number of threads, organized into blocks and grids.

StarStarStarStar

PTX

StarStarStarStar

Parallel Thread Execution (PTX) is an intermediate representation of CUDA programs that enables forward compatibility with future hardware by delaying the final machine code generation until runtime.

StarStarStarStar

Compiler Optimizations

StarStarStarStar

Compiler optimizations in CUDA involve NVCC transforming the CUDA code to enhance performance, such as loop unrolling, inlining functions, and optimizing memory accesses.

StarStarStarStar

Register Pressure

StarStarStarStar

Register pressure in CUDA programming occurs when there are too many live variables that exceed the number of available registers in a thread, potentially leading to spilling and reduced performance.

StarStarStarStar

Texture Memory

StarStarStarStar

Texture memory in CUDA is a read-only cache that can be used to reduce memory read transactions when threads access spatially localized data, taking advantage of spatial locality.

StarStarStarStar

Global Memory

StarStarStarStar

Global memory is the largest and slowest form of memory accessible to all threads in a CUDA program. It is off-chip and has a high-latency, and so should be used carefully to avoid bottlenecks.

StarStarStarStar

Memory Coalescing

StarStarStarStar

Memory coalescing refers to the process of combining memory accesses by threads in a warp to reduce the number of transactions with the global memory, leading to improved performance.

StarStarStarStar

CUDA Streams

StarStarStarStar

CUDA streams are sequences of operations that execute in order on the GPU. Different streams can execute operations concurrently, making it possible to overlap computations with memory transfers.

StarStarStarStar

Dynamic Parallelism

StarStarStarStar

Dynamic parallelism in CUDA allows kernels to be launched from within other kernels running on the GPU, enabling more complex, nested parallel computations without returning to the CPU.

StarStarStarStar

Streaming Multiprocessors (SMs)

StarStarStarStar

In CUDA, Streaming Multiprocessors (SMs) are the fundamental computing units of an NVIDIA GPU, where the warps of threads are actually executed.

StarStarStarStar

CUDA Occupancy

StarStarStarStar

CUDA occupancy is the ratio of active warps to the maximum number of warps supported on a SM at any one time, affecting the utilization and efficiency of GPU resources.

StarStarStarStar

CUDA Math Libraries

StarStarStarStar

CUDA provides a suite of math libraries, such as cuBLAS and cuFFT, which offer optimized implementations of mathematical functions and algorithms for use on GPUs.

StarStarStarStar

Kernel Launch Parameters

StarStarStarStar

Kernel launch parameters in CUDA specify the execution configuration for a kernel, including the number of blocks, the number of threads per block, and the amount of shared memory to be allocated.

StarStarStarStar

Unified Memory Advancements

StarStarStarStar

Unified Memory advancements in recent CUDA versions have improved support for automatic memory migration, coherency, and granularity, making programming for GPUs more similar to conventional programming.

StarStarStarStar

Synchronization Functions

StarStarStarStar

Synchronization functions in CUDA, such as __syncthreads(), allow threads within the same block to coordinate by waiting until all threads have reached the same point in the code.

StarStarStarStar

Asynchronous Data Transfer

StarStarStarStar

Asynchronous data transfer in CUDA allows data transfers to occur concurrently with kernel execution, minimizing idle time and improving the overall performance of the application.

StarStarStarStar

Error Handling

StarStarStarStar

Error handling in CUDA involves checking the status of kernel launches and API calls, using functions such as cudaGetErrorString() to diagnose issues within CUDA applications.

StarStarStarStar

Launch Bounds

StarStarStarStar

Launch bounds in CUDA are used to specify the minimum number of blocks and threads that should be used for a given kernel, which helps optimize register usage and increase occupancy.

StarStarStarStar

Peer-to-Peer Memory Copy

StarStarStarStar

Peer-to-peer memory copy allows for direct memory copies between GPUs without involving CPU, reducing the overhead and improving the memory transfer performance between multiple GPUs.

Know
0
Still learning
Click to flip
Know
0
Logo

© Hypatia.Tech. 2024 All rights reserved.