15 questions on matrix computations β the classic GPU programming interview topic. Matrix multiplication optimization is asked at NVIDIA, Google Brain, and every AI team!
π Instructions
Answer all 15 questions on matrix operations and GEMM optimization.
Row-major: idx = row * width + col. Tiling reduces global loads from 2N to 2N/TILE_WIDTH. cuBLAS is column-major!
β οΈ Try solving it yourself first β you'll learn more!
# This is a quiz exercise - use the MCQ interface above!
π§ Quiz Time
0 / 15 answered
1
In row-major order, how do you access element M[row][col] of a matrix with width W stored as a 1D array?
A) M[col * W + row]
B) M[row * W + col]
C) M[row + col * W]
D) M[(row + col) * W]
In row-major order, elements of each row are stored contiguously. The linear index is row * W + col, where W is the number of columns (width). Option A is column-major order. This is the most fundamental formula for GPU matrix work.
2
For a matrix addition kernel operating on two NΓN matrices, what is the computational complexity and why is it considered 'embarrassingly parallel'?
A) O(NΒ³) β each element depends on an entire row and column
B) O(NΒ²) β each output element is independent and requires one addition
C) O(NΒ² log N) β requires a reduction across rows
D) O(NΒ²) β but requires synchronization between threads
Matrix addition is O(NΒ²) β each of the NΒ² output elements is simply C[i][j] = A[i][j] + B[i][j]. It is embarrassingly parallel because no output element depends on any other output element, so all NΒ² additions can proceed simultaneously with zero inter-thread communication.
3
When launching a 2D kernel for an MΓN matrix operation, which dim3 configuration correctly maps threads to matrix elements?
A) dim3 block(16, 16); dim3 grid(M/16, N/16); with row = blockIdx.x * blockDim.x + threadIdx.x
B) dim3 block(16, 16); dim3 grid((N+15)/16, (M+15)/16); with col = blockIdx.x * blockDim.x + threadIdx.x, row = blockIdx.y * blockDim.y + threadIdx.y
C) dim3 block(256); dim3 grid(M*N/256); with row = threadIdx.x / N
The standard 2D mapping uses dim3 blocks (e.g., 16Γ16) and computes grid dimensions with ceiling division: grid.x = ceil(N/16), grid.y = ceil(M/16). Then col = blockIdx.x * blockDim.x + threadIdx.x and row = blockIdx.y * blockDim.y + threadIdx.y. The x-dimension maps to columns and y-dimension maps to rows, matching the convention where x varies fastest for coalesced memory access in row-major layout.
4
In a naive matrix multiplication kernel (C = A Γ B, all NΓN), how many global memory reads does each thread perform to compute one output element?
A) N reads from A and N reads from B = 2N total
B) NΒ² reads total from both matrices
C) 2 reads (one from A, one from B)
D) N reads total (shared between A and B)
Each thread computes C[row][col] = Ξ£ A[row][k] * B[k][col] for k = 0..N-1. This requires reading N elements from row 'row' of A and N elements from column 'col' of B, totaling 2N global memory reads per thread. This massive redundancy is exactly why tiled GEMM with shared memory exists.
5
What is the total number of floating-point operations (FLOPs) for multiplying two NΓN matrices?
A) NΒ² (one multiply per output element)
B) NΒ³ (N multiply-adds for each of NΒ² elements)
C) 2NΒ³ (N multiplications + N additions for each of NΒ² elements)
D) 2NΒ² (two operations per output element)
For each of the NΒ² output elements, we compute a dot product of length N, which requires N multiplications and N-1 additions (β N additions). Total FLOPs = NΒ² Γ 2N = 2NΒ³. This is the standard FLOP count used to calculate GFLOPS performance for GEMM benchmarks. For N=1024, that's about 2.1 billion FLOPs.
6
Why does naive matrix multiplication have poor memory access patterns when reading matrix B?
A) Matrix B is too large to fit in L2 cache
B) Threads in the same warp read the same column of B, striding through memory by N elements β causing uncoalesced accesses
C) Matrix B must be read in reverse order
D) B is always stored in column-major format
In row-major storage, consecutive elements of a column are N elements apart in memory. When threads in a warp compute elements in the same row of C, they all read the same row of A (coalesced) but different columns of B at the same k β which are actually contiguous and coalesced. However, when threads read along a column of B (same col, varying k), successive iterations stride by N. The key inefficiency is that every thread reading an entire row of A and column of B from global memory leads to massive redundant reads across threads.
7
In tiled matrix multiplication with TILE_WIDTH = 16, how many global memory reads per output element are needed for NΓN matrices?
A) 2N (same as naive)
B) 2 Γ (N / TILE_WIDTH) = 2N/16 = N/8
C) TILE_WIDTHΒ² = 256
D) 2 (one from each matrix)
With tiling, the computation is split into N/TILE_WIDTH phases. In each phase, each thread loads one element from A and one element from B into shared memory, then all threads in the block reuse these TILE_WIDTHΒ² values. So each thread performs 2 Γ (N/TILE_WIDTH) = 2N/TILE_WIDTH global memory reads instead of 2N. For TILE_WIDTH=16, this is a 16Γ reduction in global memory traffic β a massive speedup!
8
How much shared memory does a single thread block use in tiled GEMM with TILE_WIDTH = 16 and float elements?
A) 16 Γ 4 = 64 bytes
B) 16 Γ 16 Γ 4 = 1024 bytes (1 KB)
C) 2 Γ 16 Γ 16 Γ 4 = 2048 bytes (2 KB)
D) 16 Γ 16 Γ 16 Γ 4 = 16384 bytes (16 KB)
Each block allocates two shared memory tiles: one for the A sub-matrix (TILE_WIDTH Γ TILE_WIDTH) and one for the B sub-matrix (TILE_WIDTH Γ TILE_WIDTH). With float (4 bytes): 2 Γ 16 Γ 16 Γ 4 = 2048 bytes = 2 KB per block. This is important for calculating occupancy β shared memory per SM is limited (e.g., 48 KB), so 2 KB per block allows up to 24 resident blocks (hardware permitting).
9
Why is __syncthreads() critical inside the tiled GEMM kernel, and where must it be placed?
A) Only once at the end of the kernel to ensure all output is written
B) Once after loading tiles into shared memory β to ensure all data is loaded before computation begins
C) Twice per tile phase: once after loading the tile into shared memory, and once after computing with the tile (before loading the next tile)
D) It's optional β shared memory is automatically synchronized
Two barriers per phase are needed: (1) After loading β ensures all threads have written their data to shared memory before any thread starts reading it for computation. (2) After computing β ensures all threads are done reading the current tile before any thread overwrites it with data for the next phase. Missing the second barrier can cause a fast thread to overwrite shared memory while slow threads are still reading the old tile.
10
When the matrix dimension N is not a multiple of TILE_WIDTH, what must the tiled GEMM kernel do?
A) Pad the matrices to the nearest multiple of TILE_WIDTH on the host before launching
B) Use boundary checks: load 0.0 into shared memory for out-of-bounds indices, and guard the output write with a bounds check
C) Reduce TILE_WIDTH to the greatest common divisor of N and TILE_WIDTH
D) Launch extra threads that write to a dummy buffer
The standard approach is to add boundary conditions inside the kernel. When a thread's global index exceeds the matrix dimensions, it loads 0.0 into shared memory (which contributes nothing to the dot product). The final output write is also guarded: if (row < M && col < N). This avoids host-side padding overhead and works for arbitrary matrix sizes while maintaining the tile structure.
11
cuBLAS uses column-major storage by default. When calling cublasSgemm to compute C = Ξ±Β·AΒ·B + Ξ²Β·C with row-major matrices, what is the standard trick?
A) Transpose A and B on the host before calling cublasSgemm
B) Exploit the identity: computing B^T Β· A^T in column-major gives (A Β· B)^T, which is AΒ·B in row-major β so swap the matrix arguments
C) Set the CUBLAS_OP_T flag on all three matrices
D) Use cublasSetMatrix to convert row-major to column-major
The classic trick relies on the fact that a row-major matrix is equivalent to the transpose of that matrix in column-major layout. By swapping the order of arguments (passing B first, then A) and swapping M/N dimensions, cuBLAS computes B^T Β· A^T in its column-major view, which equals (AΒ·B)^T in column-major = AΒ·B in row-major. This avoids any explicit transposition and is the standard practice in production CUDA code.
12
For multiplying two 4096Γ4096 float matrices, approximately how much data must be transferred and how many FLOPs are performed?
A) Data: ~192 MB, FLOPs: ~137 billion β compute bound
B) Data: ~192 MB, FLOPs: ~67 million β memory bound
C) Data: ~64 MB, FLOPs: ~137 billion β compute bound
Data: 3 matrices Γ 4096Β² Γ 4 bytes = 3 Γ 64 MB = 192 MB. FLOPs: 2 Γ 4096Β³ β 137.4 billion. The arithmetic intensity is ~137B FLOPs / 192 MB β 715 FLOPs/byte, far exceeding the GPU's compute-to-bandwidth ratio (e.g., ~50 for an A100). This means large GEMM is solidly compute-bound, which is why GPUs can achieve near-peak TFLOPS on it.
13
What is the optimal strategy for a matrix transpose kernel to maximize memory throughput?
A) Read rows (coalesced) from input, write rows (coalesced) to output directly
B) Read columns from input into shared memory, then write rows from shared memory to output
C) Read rows (coalesced) from input into shared memory, then write the transposed tile with coalesced writes to output
D) Use a single thread to transpose the entire matrix for simplicity
The efficient transpose reads a tile from the input matrix with coalesced row reads into shared memory, then writes the transposed tile to the output with coalesced writes. Specifically: threads read input[row][col] (coalesced across col) into shMem[threadIdx.y][threadIdx.x], then write shMem[threadIdx.x][threadIdx.y] to output[col][row] β the write is coalesced because threads with consecutive threadIdx.x write to consecutive columns in the output. Both global reads and writes are coalesced.
14
When increasing TILE_WIDTH from 16 to 32 in tiled GEMM, what trade-off should you consider?
A) No trade-off β larger tiles are always better
B) Shared memory per block increases from 2 KB to 8 KB, potentially reducing occupancy; but global memory accesses are halved
C) Shared memory usage stays the same, but register usage doubles
D) Larger tiles require more __syncthreads() calls, adding overhead
Doubling TILE_WIDTH from 16 to 32 quadruples shared memory usage: 2 Γ 32 Γ 32 Γ 4 = 8 KB per block (vs. 2 KB for 16). This may reduce occupancy if shared memory per SM is limited. However, global memory reads drop from 2N/16 to 2N/32 per element β a 2Γ reduction. Also, block size becomes 32Γ32 = 1024 threads, which is the maximum for most GPUs. The optimal tile size balances memory reuse against occupancy.
15
What is NVIDIA's CUTLASS library, and when would you use it over cuBLAS?
A) CUTLASS is a Python wrapper around cuBLAS for easier matrix multiplication
B) CUTLASS provides templated C++ GEMM kernels that can be customized (fused epilogues, mixed precision, custom layouts) β use it when cuBLAS doesn't support your specific GEMM variant
C) CUTLASS is a debugging tool for matrix operation correctness
D) CUTLASS replaces cuBLAS entirely and is always faster
CUTLASS (CUDA Templates for Linear Algebra Subroutines) is an open-source collection of highly optimized, composable GEMM building blocks in header-only C++. Unlike cuBLAS (a closed-source, pre-compiled library), CUTLASS lets you customize tile sizes, data types, memory access patterns, and fuse operations (like bias add + ReLU) into the GEMM epilogue. Use CUTLASS when you need custom GEMM variants that cuBLAS doesn't cover, such as grouped GEMM, mixed-precision with unusual types, or operator fusion.