15 deep questions on CUDA's memory hierarchy â the #1 most tested topic in GPU programming interviews. Master this and you'll ace any memory-related question!
ð Instructions
Answer all 15 questions about CUDA memory types, their properties, and when to use each.
Key hierarchy: Registers (fastest) â Shared â L1/L2 Cache â Global (slowest). Shared = per-block, Global = per-grid.
â ïļ Try solving it yourself first â you'll learn more!
# This is a quiz exercise - use the MCQ interface above!
ð§ Quiz Time
0 / 15 answered
1
Which of the following correctly orders CUDA memory types from FASTEST to SLOWEST access latency?
A) Shared â Registers â L2 Cache â Global
B) Registers â Shared â L1/L2 Cache â Global
C) L1 Cache â Registers â Shared â Global
D) Registers â Global â Shared â Constant
The CUDA memory hierarchy from fastest to slowest is: Registers (~1 cycle, per-thread) â Shared memory (~1-3 cycles, per-block) â L1/L2 Cache (~30-100 cycles) â Global memory (~400-800 cycles). Registers are the fastest because they are on-chip storage directly accessible by the ALU. Shared memory is also on-chip but requires a few cycles for bank addressing. Global memory resides in off-chip DRAM, making it orders of magnitude slower.
2
A kernel needs a 2D lookup table (256 Ã 256 floats) that every thread in every block reads but never modifies. Which memory type is the BEST choice?
A) Global memory with no caching
B) Shared memory declared in each block
C) Constant memory with __constant__ qualifier
D) Texture memory bound to a CUDA array
Texture memory is ideal here. The table is 256Ã256Ã4 = 256 KB, which exceeds the 64 KB constant memory limit, ruling out option C. Shared memory (option B) is limited to 48-164 KB per block and would require every block to redundantly load the data. Texture memory provides read-only caching optimized for 2D spatial locality, making it perfect for 2D lookup tables accessed by all threads across all blocks. The texture cache efficiently serves this spatial access pattern.
3
What is the typical size limit of __constant__ memory in CUDA, and how is it accessed efficiently?
A) 256 KB, cached in L2 only
B) 64 KB, cached in a dedicated constant cache that can broadcast a single value to all threads in a warp simultaneously
C) 48 KB, stored in shared memory banks
D) 1 MB, accessed through the texture cache pipeline
CUDA constant memory is limited to 64 KB per device and is cached through a dedicated constant cache. Its key performance feature is the broadcast mechanism: when all threads in a warp read the SAME address, the value is broadcast to all threads in a single transaction (~1 cycle from cache). However, if threads in a warp access DIFFERENT addresses, the accesses are serialized. This makes constant memory ideal for coefficients, configuration values, or lookup tables where all threads need the same value simultaneously.
4
What happens when a CUDA kernel uses more register variables than the GPU's per-thread register limit?
A) The compiler throws a compilation error
B) Excess variables 'spill' into local memory, which physically resides in global memory (DRAM) and is very slow
C) Excess variables are automatically placed in shared memory
D) The kernel's block size is automatically reduced by the runtime
When a thread needs more registers than are available, the compiler performs 'register spilling' â excess variables are stored in local memory. Despite the name 'local,' this memory physically resides in off-chip global memory (DRAM) with the same ~400-800 cycle latency. Local memory is per-thread and private, but it's slow because it goes through the global memory path (though it benefits from L1/L2 caching on modern architectures). Register spilling is a major performance concern and can be monitored with the --ptxas-options=-v compiler flag.
5
Shared memory on NVIDIA GPUs is organized into banks. On modern architectures (Volta and later), how many banks does shared memory have, and what causes a bank conflict?
A) 16 banks; a conflict occurs when two threads in different warps access the same bank
B) 32 banks; a conflict occurs when two or more threads in the same warp access different addresses in the same bank
C) 64 banks; a conflict occurs when any two threads access adjacent addresses
NVIDIA GPUs from Fermi onward have 32 shared memory banks. A bank conflict occurs when two or more threads within the SAME warp access different addresses that map to the same bank, forcing serialization. If multiple threads access the EXACT same address in a bank, it triggers a broadcast (no conflict). Consecutive 4-byte words map to consecutive banks, so stride-1 access (threads[i] accessing element[i]) is conflict-free. Stride-32 access causes maximum 32-way bank conflicts.
6
What is the key difference between cudaMalloc() and cudaMallocManaged()?
A) cudaMalloc() is for host memory; cudaMallocManaged() is for device memory
B) cudaMalloc() allocates device-only memory requiring explicit transfers; cudaMallocManaged() allocates Unified Memory accessible from both host and device with automatic page migration
C) cudaMallocManaged() is always faster because it uses pinned memory internally
D) cudaMalloc() allocates global memory; cudaMallocManaged() allocates shared memory
cudaMalloc() allocates memory exclusively on the GPU. Moving data between host and device requires explicit cudaMemcpy() calls. cudaMallocManaged() allocates Unified Memory â a single pointer accessible from both CPU and GPU code â with automatic page migration handled by the CUDA runtime. While Unified Memory simplifies programming, explicit cudaMalloc() + cudaMemcpy() often yields better performance because the programmer has full control over data movement and can overlap transfers with computation using streams.
7
What is 'memory coalescing' in CUDA, and why is it critical for performance?
A) Combining multiple shared memory allocations into one large block
B) When threads in a warp access consecutive global memory addresses, the hardware merges them into fewer wide memory transactions (e.g., 128 bytes), dramatically increasing effective bandwidth
C) The process of defragmenting GPU global memory at runtime
D) Compressing data in constant memory to fit within the 64 KB limit
Memory coalescing is the GPU hardware's ability to combine individual memory accesses from threads in a warp into a minimal number of wide memory transactions. When 32 threads access 32 consecutive 4-byte addresses (128 bytes total), the hardware issues a single 128-byte transaction. Without coalescing (e.g., strided or random access), each thread's request may need a separate transaction, wasting bandwidth by up to 32Ã. This is the single most important optimization for global memory performance and is frequently tested in GPU programming interviews.
8
A senior engineer at NVIDIA asks: 'Your kernel has 256 threads per block, each needing a private array of 64 floats. Where are these arrays stored?'
A) In shared memory, since they're declared inside the kernel
B) In registers if the compiler can optimize, otherwise in local memory (which physically resides in global DRAM)
C) In constant memory, since each thread has its own constant copy
D) In the L1 cache exclusively, bypassing all other memory
Per-thread arrays declared inside a kernel go into either registers or local memory. With 64 floats (256 bytes) per thread and 256 threads, that is 64 KB per block in registers alone, likely exceeding register limits. The compiler will spill these arrays into local memory. Local memory is private per-thread but physically resides in global memory (DRAM) with ~400-800 cycle latency. It benefits from L1/L2 caching, but is still much slower than registers or shared memory. The fix: use shared memory explicitly, or reduce the per-thread array size.
9
What is pinned (page-locked) memory in CUDA, and what function allocates it?
A) cudaMalloc() allocates pinned memory on the GPU that cannot be swapped
B) cudaMallocHost() or cudaHostAlloc() allocates host memory that is page-locked, enabling faster DMA transfers and overlap with kernel execution via streams
C) cudaMallocManaged() pins memory on both host and device simultaneously
D) malloc() with the CUDA_PINNED flag allocates pinned host memory
Pinned (page-locked) memory is host memory that the OS cannot page out to disk. It is allocated with cudaMallocHost() or cudaHostAlloc(). Benefits include: (1) faster DMA transfers since the GPU can access physical addresses directly without CPU staging from pageable buffers, (2) enabling asynchronous transfers with cudaMemcpyAsync for overlap with kernel execution via CUDA streams, and (3) with the cudaHostAllocMapped flag, GPU can directly access pinned host memory (zero-copy). Caution: excessive pinned allocation reduces system memory available for paging.
10
An A100 GPU has ~2 TB/s HBM bandwidth. If a kernel reads 4 bytes per thread and you launch 1,000,000 threads, what is the minimum theoretical transfer time?
A) ~2 nanoseconds
B) ~2 microseconds
C) ~2 milliseconds
D) ~20 milliseconds
Total data = 1,000,000 à 4 bytes = 4 MB = 4 à 10^6 bytes. Bandwidth = 2 TB/s = 2 à 10^12 bytes/s. Time = Data / Bandwidth = 4 à 10^6 / 2 à 10^12 = 2 à 10^-6 seconds = 2 microseconds. This demonstrates that modern GPUs have enormous bandwidth, but for small transfers the latency overhead can dominate. For just 4 MB, the kernel launch overhead (~5-10 Ξs) may actually exceed the data transfer time.
11
Which statement about CUDA L1 and L2 caches is TRUE?
A) L1 cache is shared across all SMs; L2 cache is private to each SM
B) L1 cache is private to each SM and shares physical on-chip SRAM with shared memory; L2 cache is shared across all SMs
C) Both L1 and L2 are programmer-managed and must be explicitly loaded via intrinsics
D) L1 cache only caches constant memory; L2 only caches texture memory
On modern CUDA GPUs (Volta+): the L1 cache is private to each Streaming Multiprocessor (SM) and shares the same on-chip SRAM with shared memory â the split can be configured. The L2 cache is a unified cache shared by ALL SMs on the GPU (e.g., 40 MB on A100). Both are hardware-managed (transparent to the programmer), though CUDA offers hints like __ldg() for read-only caching and L2 persistence controls on Ampere and newer architectures.
12
In a CUDA kernel, you declare: __shared__ float tile[32][32]; How does this memory's lifetime and visibility work?
A) It persists for the entire application lifetime and is visible to all kernels
B) It is allocated when the block begins execution, visible to all threads in that block only, and deallocated when the block completes
C) It is allocated per-thread and only visible to the declaring thread
D) It is allocated per-warp and visible to all 32 threads in the warp
Shared memory declared with __shared__ has block-level scope and lifetime. It is allocated on-chip when a thread block is scheduled to an SM and is accessible by ALL threads within that block. Threads in different blocks cannot see each other's shared memory. It is deallocated when all threads in the block finish execution. This per-block isolation means shared memory is perfect for intra-block communication and data reuse, but inter-block communication requires global memory or cooperative groups.
13
A Google interviewer asks: 'You're implementing tiled matrix multiplication for 4096Ã4096 matrices. Why use shared memory for the tiles instead of relying on L1 cache?'
A) L1 cache doesn't exist on NVIDIA GPUs
B) Shared memory is programmer-managed, guaranteeing data stays on-chip; L1 is hardware-managed and may evict your data unpredictably, plus shared memory allows explicit synchronization between threads with __syncthreads()
C) Shared memory is larger than L1 cache on all GPU architectures
D) L1 cache cannot store floating-point data, only integer data
The key advantage of shared memory is deterministic control. Because the programmer explicitly loads data into shared memory, it is guaranteed to stay on-chip as long as needed. The L1 cache is hardware-managed â cached data can be evicted at any time under memory pressure, leading to unpredictable cache misses. Additionally, shared memory supports __syncthreads() for coordinated access between threads in a block, ensuring all threads see consistent tile data. For tiled matrix multiplication, this determinism and synchronization are essential for correctness and peak performance.
14
What happens if all 32 threads in a warp read the SAME address from constant memory?
A) 32 separate memory transactions are issued, causing severe slowdown
B) A bank conflict occurs, serializing the accesses to 32 sequential reads
C) The value is fetched once from the constant cache and broadcast to all 32 threads in a single cycle
D) The warp is split into two half-warps, each requiring a separate cache read
This is the ideal use case for constant memory. The constant cache has a broadcast mechanism: when all threads in a warp request the same address, the value is read once from cache and broadcast to all 32 threads simultaneously, costing essentially one cycle from cache. Conversely, if threads access N different constant memory addresses, the requests serialize into N sequential reads, making performance NÃ worse. This is why constant memory is best for uniform values â physical constants, kernel parameters, or filter coefficients where every thread needs the same data.
15
A Meta engineer shows you this kernel and asks what's wrong:
__global__ void kernel(float* out, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ float temp[256];
temp[threadIdx.x] = compute(i);
out[i] = temp[threadIdx.x] + temp[threadIdx.x + 1];
}
What is the primary bug?
A) The kernel doesn't check if i < N, causing out-of-bounds global memory access
B) There is a missing __syncthreads() between the shared memory write and the subsequent read of a neighbor's value, causing a race condition
C) Shared memory arrays cannot be indexed with threadIdx.x
D) The compute() function cannot be called from device code
The critical bug is a missing __syncthreads() barrier. When a thread reads temp[threadIdx.x + 1], it reads a value written by a DIFFERENT thread (the one with threadIdx.x + 1). Without __syncthreads() between the shared memory write and read, there's no guarantee the other thread has completed its write â a classic race condition producing silently incorrect results. The fix: add __syncthreads() after the write and before the read. Option A (bounds checking) is also a real concern, but the race condition is the primary bug and one of the most commonly asked CUDA interview questions.