🏆 Top CUDA Interview Questions

1

An NVIDIA A100 GPU has 108 Streaming Multiprocessors (SMs). Each SM can schedule up to 4 warps simultaneously. If a kernel launches 27,648 threads with a block size of 256, how many blocks are created and what is the maximum number of blocks that can run concurrently if each SM supports up to 16 resident blocks?

A) 108 blocks; all 108 run concurrently

B) 108 blocks; only 64 run concurrently

C) 216 blocks; 108 run concurrently

D) 108 blocks; limited by register usage only

27,648 / 256 = 108 blocks. With 108 SMs and each SM supporting up to 16 resident blocks, all 108 blocks (1 per SM) can run concurrently. This is an ideal launch configuration that maps one block to each SM. In interviews, they test whether you can compute grid dimensions and understand SM-level scheduling limits.

2

You're profiling a CUDA kernel with Nsight Compute and see that L2 cache hit rate is 95% but global memory throughput is only 15% of peak. The kernel is memory-bound. What is the MOST LIKELY root cause?

A) Warp divergence is causing idle threads

B) Uncoalesced memory access patterns causing excessive memory transactions

C) Too many registers per thread reducing occupancy

D) Shared memory bank conflicts

High L2 hit rate with low global memory throughput is the classic signature of uncoalesced memory accesses. Each warp request generates multiple memory transactions instead of one, wasting bandwidth. Even though data is found in L2 cache (high hit rate), the access pattern forces many small transactions rather than few large ones. Warp divergence (A) affects compute, not memory throughput directly. Register pressure (C) would show low occupancy. Bank conflicts (D) affect shared memory, not global memory throughput.

3

In CUDA's Unified Memory (managed memory), what happens when a GPU kernel accesses a page that is currently resident in CPU memory on a system with hardware page migration support (e.g., Pascal or newer)?

A) The kernel crashes with an illegal memory access error

B) A page fault triggers automatic migration of the page from CPU to GPU memory via the CUDA driver

C) The GPU reads directly from CPU memory over PCIe with no migration

D) The CUDA runtime pre-fetches all managed allocations before kernel launch

Starting with Pascal (Compute Capability 6.0+), CUDA supports hardware page faulting. When the GPU accesses a page not resident in GPU memory, a page fault is triggered and the CUDA driver automatically migrates that page from CPU to GPU memory. This is the key innovation of Unified Memory with on-demand page migration. Before Pascal, all managed memory was migrated in bulk before kernel launch. The runtime does NOT pre-fetch automatically (D) — you must use cudaMemPrefetchAsync() for optimal performance. Zero-copy (C) is a different mechanism entirely.

4

A candidate writes this kernel for parallel reduction: __global__ void reduce(float* data, float* result, int n) { __shared__ float sdata[256]; int tid = threadIdx.x; int i = blockIdx.x * blockDim.x + threadIdx.x; sdata[tid] = (i < n) ? data[i] : 0.0f; __syncthreads(); for (int s = 1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) sdata[tid] += sdata[tid + s]; __syncthreads(); } if (tid == 0) result[blockIdx.x] = sdata[0]; } What is the PRIMARY performance problem with this reduction?

A) It uses too much shared memory

B) The modulo operator causes warp divergence — active threads are interleaved rather than contiguous, so warps are never fully utilized

C) __syncthreads() is called too many times

D) The kernel has a race condition on sdata[]

This is the classic 'interleaved addressing' reduction — a favorite interview question. The problem is that `tid % (2*s) == 0` selects threads 0, 2, 4, 6... in the first iteration. These threads are spread across warps, so EVERY warp has some active and some inactive threads (divergence). The fix is 'sequential addressing': use `for (int s = blockDim.x/2; s > 0; s >>= 1)` with `if (tid < s) sdata[tid] += sdata[tid + s]`. This keeps active threads contiguous in the first warps, so later warps can exit entirely. There's no race condition (D) because __syncthreads() correctly separates reads and writes.

5

What is Dynamic Parallelism in CUDA, and what is the MAIN hardware requirement?

A) Launching kernels from within device code; requires Compute Capability 3.5 or higher

B) Automatically adjusting the number of threads at runtime; requires any CUDA GPU

C) Running multiple kernels concurrently using streams; requires Compute Capability 2.0+

D) Dynamically allocating shared memory at kernel launch; requires Compute Capability 1.0+

Dynamic Parallelism allows device code (GPU kernels) to launch new kernels without returning control to the host CPU. This enables recursive algorithms, adaptive mesh refinement, and workloads where the amount of parallel work is data-dependent. It requires Compute Capability 3.5+ (Kepler GK110 or newer). Option C describes streams/concurrency, not dynamic parallelism. Option D describes dynamic shared memory allocation via the third kernel launch parameter, which is a different feature entirely.

6

You're training a large language model using mixed-precision training with FP16. The training diverges after 10K steps. Which of the following is the MOST LIKELY fix?

A) Switch entirely to FP64 (double precision)

B) Implement loss scaling — multiply the loss by a large factor before backprop, then unscale gradients before the optimizer step

C) Reduce the batch size to 1

D) Disable gradient checkpointing

Mixed-precision training divergence is almost always caused by gradient underflow in FP16. Small gradient values (common in deep networks) fall below FP16's minimum representable value (~5.96e-8) and become zero. Loss scaling multiplies the loss by a large factor (e.g., 1024 or dynamic scaling), which scales up all gradients during backprop, keeping them in FP16's representable range. Gradients are then unscaled before the optimizer step (which uses FP32 master weights). This is the standard technique used in NVIDIA's Apex/AMP and PyTorch's torch.cuda.amp. Switching to FP64 (A) would waste enormous GPU resources and is never the right answer for training.

7

What is the purpose of Tensor Cores on NVIDIA GPUs and which data types do they natively support on Ampere (A100)?

A) Tensor Cores perform matrix multiply-accumulate operations; Ampere supports FP16, BF16, TF32, INT8, INT4, and FP64 on Tensor Cores

B) Tensor Cores are specialized cores for texture sampling; they support only FP32

C) Tensor Cores perform scalar multiply-add; they support FP16 only

D) Tensor Cores handle memory coalescing; they support all standard data types

Tensor Cores are specialized hardware units that perform matrix multiply-accumulate (MMA) operations — specifically D = A × B + C where A, B, C, D are matrix fragments. On Ampere (A100), Tensor Cores support: FP16 (half), BF16 (bfloat16), TF32 (TensorFloat-32, a 19-bit format), INT8, INT4, and even FP64. TF32 is particularly notable — it uses FP32 ranges with reduced mantissa precision and is the default mode for torch.matmul on Ampere, giving near-FP32 accuracy with 8x the throughput. This is a common NVIDIA interview question.

8

TRICK QUESTION: A kernel has 100% occupancy. Does this guarantee maximum performance?

A) Yes — 100% occupancy means all SMs are fully utilized, so performance is maximized

B) No — high occupancy can actually HURT performance if it reduces per-thread register or shared memory availability, and many kernels achieve peak performance at 50-75% occupancy

C) Yes — NVIDIA always recommends targeting 100% occupancy

D) No — but only because of warp divergence

This is one of the most famous CUDA interview trick questions. High occupancy does NOT guarantee high performance. Pushing for 100% occupancy often forces the compiler to spill registers to local memory (slow) or reduces available shared memory per block. Many kernels achieve peak performance at 50-75% occupancy because each thread has more registers and shared memory, enabling better data reuse and less memory traffic. The Occupancy Calculator is a guide, not a target. Volkov's famous GTC talk 'Better Performance at Lower Occupancy' demonstrated this principle. Always profile with Nsight Compute to find the optimal occupancy for YOUR specific kernel.

9

What is the difference between zero-copy memory (cudaHostAllocMapped) and Unified Memory (cudaMallocManaged)?

A) They are identical — both terms refer to the same CUDA feature

B) Zero-copy memory is accessed by the GPU directly over PCIe without migration; Unified Memory uses page migration to move data to the accessing processor's physical memory

C) Unified Memory is accessed over PCIe; zero-copy memory uses page migration

D) Zero-copy requires Compute Capability 6.0+; Unified Memory works on all GPUs

Zero-copy memory (pinned host memory mapped into GPU address space) allows the GPU to access host memory directly over the PCIe/NVLink bus WITHOUT migrating the data. Every GPU access generates a PCIe transaction — great for small, infrequent accesses but terrible for bandwidth-intensive kernels. Unified Memory (cudaMallocManaged) physically migrates pages between CPU and GPU memory on demand (page faulting on CC 6.0+) or in bulk (pre-CC 6.0). Once migrated, GPU accesses go to fast HBM. Zero-copy has been available since CC 1.0+; Unified Memory since CC 3.0 (with on-demand migration since CC 6.0).

10

You see this CUDA error: 'an illegal memory access was encountered' (error code 700). The kernel launched successfully but the error appears after cudaDeviceSynchronize(). What is the BEST debugging approach?

A) Recompile with -O0 and hope the error goes away

B) Run with compute-sanitizer (cuda-memcheck) to identify the exact thread and memory address causing the out-of-bounds or misaligned access

C) Add printf statements inside the kernel and recompile

D) Ignore it — error 700 is non-fatal

Error 700 (cudaErrorIllegalAddress) means a thread accessed invalid GPU memory — out-of-bounds, misaligned, or freed memory. Since kernel launches are asynchronous, the error surfaces at the next synchronization point. The BEST first step is `compute-sanitizer --tool memcheck ./your_program` (or the legacy `cuda-memcheck`). This tool instruments every memory access and reports the exact kernel, thread (blockIdx, threadIdx), and memory address causing the violation. Printf debugging (C) is unreliable because the illegal access may crash the kernel before printf output is flushed. Error 700 is absolutely fatal (D) — it corrupts the CUDA context.

11

What is NVLink and how does its bandwidth compare to PCIe Gen4 x16 for GPU-to-GPU communication?

A) NVLink is a proprietary high-bandwidth interconnect; NVLink 3.0 (A100) provides 600 GB/s total bidirectional bandwidth vs PCIe Gen4 x16's ~32 GB/s per direction

B) NVLink provides the same bandwidth as PCIe but lower latency

C) NVLink is only used for CPU-GPU communication, not GPU-GPU

D) NVLink provides 64 GB/s, double PCIe Gen3's 32 GB/s

NVLink is NVIDIA's high-bandwidth GPU interconnect designed for multi-GPU systems. NVLink 3.0 on A100 provides 600 GB/s total bidirectional bandwidth (12 links × 50 GB/s per link), which is roughly 10x the bandwidth of PCIe Gen4 x16 (~32 GB/s per direction, 64 GB/s bidirectional). NVLink enables efficient multi-GPU training by dramatically reducing the communication bottleneck in allreduce operations. NVLink connects GPU-to-GPU directly (and GPU-to-NVSwitch), bypassing the CPU/PCIe entirely. This is critical knowledge for anyone working on distributed deep learning or HPC.

12

What are CUDA Graphs and what problem do they solve?

A) CUDA Graphs are a visualization tool for profiling kernels

B) CUDA Graphs capture a sequence of operations (kernels, memcpy) into a graph that can be launched with a single API call, eliminating per-launch CPU overhead for workloads with many small kernels

C) CUDA Graphs are data structures for graph algorithms like BFS/DFS

D) CUDA Graphs automatically parallelize sequential CPU code

CUDA Graphs solve the 'kernel launch overhead' problem. When a workload consists of many small kernels (common in deep learning inference), the CPU overhead of launching each kernel (~5-10 µs per launch) can dominate total execution time. CUDA Graphs let you capture an entire sequence of operations — kernel launches, memory copies, dependencies — into a graph object during a 'capture' phase. The graph is then 'instantiated' into an executable, and replayed with a single cudaGraphLaunch() call. This reduces launch overhead to nearly zero and is a key optimization for inference at companies like NVIDIA TensorRT and Meta's AITemplate.

13

In the CUDA compilation model, what is PTX and how does it relate to SASS?

A) PTX is the final machine code; SASS is an intermediate representation

B) PTX (Parallel Thread Execution) is a virtual ISA that is forward-compatible; SASS is the actual GPU machine code specific to a GPU architecture. nvcc compiles CUDA C++ → PTX → SASS

C) PTX and SASS are both names for the same GPU machine code

D) PTX is a profiling tool; SASS is the shader assembly language

PTX is NVIDIA's virtual instruction set architecture — it's an intermediate representation that is forward-compatible across GPU generations. SASS (Streaming ASSembler) is the actual native machine code that runs on a specific GPU architecture. The compilation pipeline is: CUDA C++ → (nvcc frontend) → PTX → (ptxas assembler) → SASS. When you use `-gencode arch=compute_80,code=sm_80`, `compute_80` specifies the PTX virtual architecture and `sm_80` specifies the SASS target. Embedding PTX in your binary (with `code=compute_80`) enables JIT compilation for future architectures — this is CUDA Forward Compatibility. Understanding this pipeline is essential for NVIDIA interviews.

14

You have a kernel that processes a 2D image of size 4096×4096. You launch it with block size (32, 32) = 1024 threads per block. The kernel accesses a 2D array row-major in global memory as: float val = image[row * width + col]; where row = blockIdx.y * blockDim.y + threadIdx.y and col = blockIdx.x * blockDim.x + threadIdx.x. Is this access pattern coalesced?

A) No — 2D blocks always cause uncoalesced accesses

B) Yes — consecutive threads in the x-dimension (which form a warp) access consecutive memory addresses because col varies with threadIdx.x, and row*width is constant within a warp row

C) No — row-major layout is never coalesced on GPUs

D) Yes — but only if the width is a power of 2

This is perfectly coalesced! The key insight: threads in a warp are numbered by threadIdx.x varying fastest (in a 2D block, warp 0 has threadIdx.y=0, threadIdx.x=0..31). For these 32 threads in a warp, `row` is the same (same threadIdx.y) and `col` is threadIdx.x = 0, 1, 2, ..., 31. So they access image[row*4096 + 0], image[row*4096 + 1], ..., image[row*4096 + 31] — 32 consecutive floats = 128 bytes = one perfectly coalesced transaction. This is WHY we conventionally map x to columns in row-major layouts. Width being a power of 2 can actually cause partition camping (D is wrong for the stated reason).

15

What is NCCL and why is it preferred over MPI for multi-GPU deep learning training?

A) NCCL (NVIDIA Collective Communications Library) provides GPU-optimized collective operations (allreduce, broadcast, etc.) that directly use NVLink/NVSwitch and PCIe topology awareness, avoiding unnecessary GPU→CPU→GPU copies that MPI would require

B) NCCL is a CPU networking library that replaces TCP/IP

C) NCCL is identical to MPI but with a different name

D) NCCL is only for single-GPU machines to communicate between CUDA streams

NCCL is purpose-built for GPU-to-GPU collective communication. Unlike MPI (which was designed for CPU clusters), NCCL operations work directly on GPU buffers using NVLink, NVSwitch, PCIe, and InfiniBand without staging through CPU memory. NCCL is topology-aware — it detects the physical interconnect layout (which GPUs share NVLink, which are on the same PCIe switch) and selects optimal communication algorithms (ring, tree, etc.). This makes it dramatically faster for operations like allreduce during distributed training. PyTorch's DistributedDataParallel and NCCL backend is the standard for multi-GPU training at scale at companies like Meta, Google, and Microsoft.

16

What is the maximum number of threads per block on current NVIDIA GPUs, and what happens if you exceed it?

A) 512 threads; the extra threads are silently dropped

B) 1024 threads; the kernel launch fails silently and no work is done — you must check cudaGetLastError() to detect this

C) 2048 threads; the GPU automatically splits the block

D) There is no limit — the hardware handles any block size

Since Compute Capability 2.0, the maximum threads per block is 1024. If you launch a kernel with more than 1024 threads per block (e.g., dim3(32, 33) = 1056), the launch SILENTLY fails. The kernel simply doesn't execute. This is one of CUDA's most dangerous pitfalls — no exception is thrown, no crash occurs. You MUST check `cudaGetLastError()` after every kernel launch in production code. This question tests both knowledge of hardware limits AND CUDA error handling best practices. Many bugs in production code come from unchecked kernel launch failures.

17

Cooperative Groups is a CUDA programming model extension. What capability does it provide that __syncthreads() alone cannot?

A) Cooperative Groups allows synchronization across ALL thread blocks in a grid (grid-level sync), not just within a single block

B) Cooperative Groups only provides warp-level synchronization

C) Cooperative Groups replaces CUDA streams

D) Cooperative Groups is a host-side API for synchronizing CPU threads

The fundamental limitation of __syncthreads() is that it only synchronizes threads within a single block. Cooperative Groups extends CUDA with flexible grouping and synchronization primitives including: thread_block (equivalent to __syncthreads()), tiled_partition (sub-block groups, e.g., warp-level), and critically — grid_group for synchronizing ALL blocks in a grid. Grid-level sync (cooperative_groups::this_grid().sync()) enables algorithms that need a global barrier, like certain iterative solvers, without launching separate kernels. It requires cooperative kernel launch via cudaLaunchCooperativeKernel() and that all blocks can be resident simultaneously on the GPU.

18

In production CUDA applications, what is a memory pool (cudaMemPool) and why would you use one instead of calling cudaMalloc/cudaFree directly?

A) Memory pools are only used for Unified Memory management

B) Memory pools pre-allocate and cache GPU memory, allowing cudaMallocAsync/cudaFreeAsync to reuse allocations from the pool without expensive OS/driver calls, dramatically reducing allocation overhead in workloads with many dynamic allocations

C) Memory pools allocate CPU memory only

D) Memory pools are a debugging tool that tracks memory leaks

cudaMalloc is expensive (~1ms) because it involves driver calls and potentially OS-level memory mapping. In production workloads like deep learning frameworks (PyTorch's CUDACachingAllocator, TensorFlow's BFC allocator), thousands of temporary buffers are allocated per iteration. Memory pools (introduced in CUDA 11.2) maintain a cache of freed GPU memory blocks. cudaMallocAsync/cudaFreeAsync return memory to the pool instead of the OS, so subsequent allocations can be satisfied from cached blocks with negligible overhead. Pools also support stream-ordered allocation, ensuring correctness in async workflows. PyTorch's memory allocator is essentially a sophisticated memory pool — understanding this is critical for performance debugging at AI companies.

19

A candidate claims: 'Shared memory is always faster than global memory.' Is this statement TRUE or FALSE, and why?

A) TRUE — shared memory is on-chip SRAM and always faster than off-chip DRAM

B) FALSE — shared memory with heavy bank conflicts can be slower than coalesced global memory reads that hit in L1/L2 cache

C) TRUE — shared memory has no latency

D) FALSE — shared memory is actually slower than global memory on Ampere GPUs

This is a nuanced interview question. While shared memory IS on-chip SRAM with ~20-30 cycle latency vs global memory's ~200-800 cycle latency, bank conflicts can serialize shared memory accesses. A 32-way bank conflict turns a single-cycle access into 32 serial accesses (~640 cycles worst case). Meanwhile, global memory reads that hit in L1 cache (~30 cycles on Ampere) or L2 cache (~200 cycles) with perfect coalescing can be very fast. In practice, poorly-used shared memory (heavy bank conflicts, no data reuse) can be slower than well-accessed global memory via caches. The correct answer is: shared memory is faster WHEN used correctly, with minimal bank conflicts and significant data reuse to amortize the load cost.

20

You're optimizing a production inference pipeline. The model runs 500 small CUDA kernels per inference request. Each kernel takes ~10µs of GPU time, but total wall-clock time is 15ms instead of the expected 5ms. CPU utilization is high. What is the MOST effective optimization?

A) Increase the GPU clock speed

B) Use CUDA Graphs to capture the entire inference pipeline and replay it with a single launch, eliminating per-kernel CPU launch overhead

C) Merge all 500 kernels into a single mega-kernel manually

D) Switch to a CPU-only implementation

The math: 500 kernels × 10µs = 5ms of GPU work, but wall-clock is 15ms — the extra 10ms is CPU overhead from launching 500 kernels (~20µs CPU overhead per launch × 500 = 10ms). This is the classic 'launch overhead bound' scenario common in inference pipelines. CUDA Graphs capture the entire sequence of 500 kernels into an executable graph during a warmup pass, then replay it with a single cudaGraphLaunch() call. This reduces total launch overhead to ~30-50µs for the entire graph. Manually merging kernels (C) is technically possible but extremely labor-intensive and error-prone. CUDA Graphs achieve most of the benefit with minimal code changes — this is exactly what TensorRT and PyTorch's torch.cuda.CUDAGraph do in production.

🏆 Top CUDA Interview Questions 35 XP Hard