Discover published sets by community

Explore tens of thousands of sets crafted by our community.

Categories

Computer Science

Parallel Computing

Amdahl's Law and Gustafson's Law

Basic Parallel Computing Terminology

Benchmarking Parallel Code

CUDA Basics for GPU Parallel Programming

Deadlocks and Livelocks

Data Locality in Parallel Algorithms

Fault Tolerance in Parallel Systems

GPU Programming Basics

Introduction to High-Performance Computing (HPC)

Locks, Mutexes, and Semaphores

Message Passing vs Shared Memory

Parallel Computing Patterns

Parallel Algorithms for Data Structures

Parallel File Systems

Parallel Computing Algorithms

Parallel Computing in Cloud Environments

Parallel Computer Architectures

Parallel Sorting Algorithms

OpenMP Directives and Clauses

Parallel Computing with Python and multiprocessing

Parallel Programming Libraries

Types of Parallelism in Computing

Speedup and Efficiency in Parallel Systems

Shared Memory Programming with Pthreads

Task and Data Parallelism

CUDA Basics for GPU Parallel Programming

Flashcards

0/29

Still learning

What is CUDA?

Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model created by NVIDIA for general computing on graphical processing units (GPUs). It allows developers to use 'C for CUDA' to directly create software that can run on GPUs.

CUDA Kernels

In CUDA, kernels are functions that you define in the code which run in parallel on the GPU. They are launched with a specific number of threads, organized into blocks and grids.

Thread Hierarchy

CUDA organizes threads in a hierarchy, including threads, blocks, warps, and grids. Threads within the same block can communicate via shared memory, while different blocks are generally independent.

Shared Memory

Shared memory in CUDA refers to a small bank of memory on the GPU that is shared among threads within the same block, providing a method for threads to communicate and share data swiftly.

Global Memory

Global memory is the largest and slowest form of memory accessible to all threads in a CUDA program. It is off-chip and has a high-latency, and so should be used carefully to avoid bottlenecks.

Warp

A warp is a group of 32 threads in CUDA that execute the same instruction at the same time. Warps are crucial for achieving full efficiency on GPU architectures.

Memory Coalescing

Memory coalescing refers to the process of combining memory accesses by threads in a warp to reduce the number of transactions with the global memory, leading to improved performance.

Atomic Operations

Atomic operations in CUDA are used to prevent data races when different threads attempt to read and write to the same memory location simultaneously. They ensure these operations are completed without interference.

Grids and Blocks

In CUDA, a grid is a collection of blocks that can execute kernels independently. The dimensionality of both grids and blocks can be one, two, or three-dimensional, to map to problem space efficiently.

CUDA Streams

CUDA streams are sequences of operations that execute in order on the GPU. Different streams can execute operations concurrently, making it possible to overlap computations with memory transfers.

NVCC

NVIDIA's CUDA Compiler (NVCC) is a tool that compiles CUDA code into PTX (Parallel Thread Execution) instructions that can be executed on a CUDA-enabled GPU.

PTX

Parallel Thread Execution (PTX) is an intermediate representation of CUDA programs that enables forward compatibility with future hardware by delaying the final machine code generation until runtime.

CUDA Event API

The CUDA Event API allows for fine-grained performance measurement by letting developers record the time between events on the GPU, facilitating the optimization of CUDA applications.

Texture Memory

Texture memory in CUDA is a read-only cache that can be used to reduce memory read transactions when threads access spatially localized data, taking advantage of spatial locality.

Constant Memory

Constant memory is a small, fast, read-only cache in CUDA that provides low-latency access to the same data for all threads. It is best used for data that does not change over the course of a kernel execution.

Managed Memory

Managed memory in CUDA refers to the Unified Memory feature, which provides a single memory space accessible from both the CPU and GPU, simplifying data management between host and device.

Register pressure in CUDA programming occurs when there are too many live variables that exceed the number of available registers in a thread, potentially leading to spilling and reduced performance.

Compiler Optimizations

Compiler optimizations in CUDA involve NVCC transforming the CUDA code to enhance performance, such as loop unrolling, inlining functions, and optimizing memory accesses.

Kernel Launch Parameters

Kernel launch parameters in CUDA specify the execution configuration for a kernel, including the number of blocks, the number of threads per block, and the amount of shared memory to be allocated.

Streaming Multiprocessors (SMs)

In CUDA, Streaming Multiprocessors (SMs) are the fundamental computing units of an NVIDIA GPU, where the warps of threads are actually executed.

CUDA Occupancy

CUDA occupancy is the ratio of active warps to the maximum number of warps supported on a SM at any one time, affecting the utilization and efficiency of GPU resources.

Dynamic Parallelism

Dynamic parallelism in CUDA allows kernels to be launched from within other kernels running on the GPU, enabling more complex, nested parallel computations without returning to the CPU.

Asynchronous Data Transfer

Asynchronous data transfer in CUDA allows data transfers to occur concurrently with kernel execution, minimizing idle time and improving the overall performance of the application.

Synchronization Functions

Synchronization functions in CUDA, such as __syncthreads(), allow threads within the same block to coordinate by waiting until all threads have reached the same point in the code.

Peer-to-Peer Memory Copy

Peer-to-peer memory copy allows for direct memory copies between GPUs without involving CPU, reducing the overhead and improving the memory transfer performance between multiple GPUs.

Unified Memory Advancements

Unified Memory advancements in recent CUDA versions have improved support for automatic memory migration, coherency, and granularity, making programming for GPUs more similar to conventional programming.

Error Handling

Error handling in CUDA involves checking the status of kernel launches and API calls, using functions such as cudaGetErrorString() to diagnose issues within CUDA applications.

Launch Bounds

Launch bounds in CUDA are used to specify the minimum number of blocks and threads that should be used for a given kernel, which helps optimize register usage and increase occupancy.

CUDA Math Libraries

CUDA provides a suite of math libraries, such as cuBLAS and cuFFT, which offer optimized implementations of mathematical functions and algorithms for use on GPUs.

Know

Still learning

Click to flip

Know