🚀 Quiz: Optimization

1

What does 'memory coalescing' mean in CUDA?

A) Combining multiple small kernel launches into one large launch

B) Merging multiple threads' global memory accesses into fewer, wider memory transactions

C) Copying data from host memory to device memory in a single transfer

D) Using shared memory as a cache to reduce global memory latency

Memory coalescing is a hardware optimization where consecutive threads in a warp access consecutive memory addresses, allowing the memory controller to merge (coalesce) those individual requests into a minimum number of wide memory transactions (e.g., 32-byte, 64-byte, or 128-byte segments). This maximizes global memory throughput. It is NOT about kernel launches, host-to-device copies, or shared memory caching — those are entirely different concepts.

2

Which access pattern achieves PERFECT coalescing for a warp of 32 threads reading float values from global memory?

A) Each thread reads data[threadIdx.x * 32]

B) Each thread reads data[threadIdx.x]

C) Each thread reads data[rand() % N]

D) Each thread reads data[blockDim.x - threadIdx.x]

Perfect coalescing happens when thread i accesses address base + i, meaning consecutive threads read consecutive addresses. data[threadIdx.x] gives exactly this — thread 0 reads data[0], thread 1 reads data[1], etc. Option A is a stride-32 pattern (threads access addresses 32 elements apart), which is terrible — each access hits a different cache line. Option C is random access (worst case). Option D is reversed order, which on modern architectures may still coalesce within a 128-byte segment, but is not guaranteed to be perfectly coalesced.

3

A kernel accesses global memory with a stride of 2: data[2 * threadIdx.x]. What is the impact?

A) No impact — the hardware handles any stride efficiently

B) 50% bandwidth utilization because every other element in each cache line is wasted

C) The kernel will fail to compile due to misaligned access

D) The GPU will automatically rearrange threads to coalesce the access

With stride-2 access, threads read every other element: data[0], data[2], data[4], ... The memory controller still fetches full cache lines (128 bytes), but only half the data in each line is actually used. This wastes 50% of the memory bandwidth. Higher strides waste even more — stride-32 wastes 96.875%! The hardware does NOT rearrange threads or fix strided patterns. The kernel compiles fine; it's a performance problem, not a correctness problem.

4

What is 'warp divergence' in CUDA?

A) When threads in different blocks execute different kernels

B) When threads within the same warp take different paths at a branch (if/else)

C) When warps are scheduled on different SMs

D) When a warp accesses misaligned memory causing multiple transactions

Warp divergence occurs when threads within the same warp (32 threads that execute in lockstep) encounter a conditional branch (if/else, switch) and some threads take one path while others take a different path. Since all threads in a warp share a single program counter (on pre-Volta) or must converge (on Volta+), both paths must be executed serially — threads not on the active path are masked off. This effectively halves (or worse) the warp's throughput at that branch. Divergence is per-warp, not per-block or per-grid.

5

How does the GPU handle a warp where 16 threads take the 'if' branch and 16 take the 'else' branch?

A) The warp splits into two half-warps that execute in parallel on separate cores

B) Both branches execute serially — first the 'if' path with 16 threads active, then the 'else' path with the other 16 active

C) The GPU picks the branch taken by the majority and ignores the other threads

D) A hardware exception is raised and the kernel aborts

When a warp diverges, the GPU executes BOTH paths serially. First, the 16 threads that took the 'if' branch execute while the other 16 are masked (disabled but still occupy the pipeline). Then the 16 threads for the 'else' branch execute while the first 16 are masked. At the reconvergence point, all 32 threads resume together. The warp does NOT split into independent half-warps (that's not how SIMT works), the GPU does NOT speculate on majority, and there's no exception. Worst case: if each thread takes a unique path in a switch, all paths serialize.

6

What is 'occupancy' in CUDA?

A) The percentage of GPU cores that are actively computing at any moment

B) The ratio of active warps to the maximum number of warps an SM can support

C) The fraction of global memory that is currently allocated

D) The percentage of time the GPU spends executing kernels vs. being idle

Occupancy is defined as the ratio of active warps on an SM to the maximum number of warps that SM can support. For example, if an SM supports up to 64 warps and your kernel configuration results in 32 active warps, occupancy is 50%. Higher occupancy helps the warp scheduler hide memory latency by switching between warps. It is NOT about active cores (that's utilization), memory allocation, or GPU idle time (that's duty cycle). Note: 100% occupancy doesn't guarantee maximum performance — other bottlenecks may dominate.

7

A kernel uses 64 registers per thread. The SM supports a maximum of 65,536 registers. What is the maximum number of threads that can be resident on the SM?

A) 2048

B) 1024

C) 512

D) 256

Maximum resident threads = total registers / registers per thread = 65,536 / 64 = 1,024 threads. This means at most 1,024 / 32 = 32 warps can be active. If the SM supports 64 warps max, occupancy is limited to 50% by register usage alone. This is why reducing register pressure (via compiler flags like --maxrregcount or algorithmic changes) can improve occupancy. Each thread's register demand directly limits how many threads can coexist on the SM.

8

Which nvcc compiler flag enables aggressive mathematical optimizations that may sacrifice IEEE-754 precision?

A) -O3

B) --use_fast_math

C) --ftz=true

D) -arch=sm_80

--use_fast_math is a compound flag that enables multiple aggressive optimizations: it turns on --ftz=true (flush denormals to zero), --prec-div=false (fast but imprecise division), --prec-sqrt=false (fast but imprecise square root), and --fmad=true (fused multiply-add). These trade IEEE-754 compliance for speed. -O3 controls host-side optimization level, not device math precision. --ftz=true alone only handles denormals. -arch specifies the target GPU architecture, not math precision. Use --use_fast_math when slight numerical error is acceptable (e.g., graphics, ML training).

9

What does '#pragma unroll' do when placed before a loop in CUDA device code?

A) It parallelizes the loop iterations across multiple threads

B) It tells the compiler to fully unroll the loop, replacing it with repeated sequential instructions

C) It moves the loop computation from GPU to CPU for faster execution

D) It converts the loop into a recursive function call

#pragma unroll instructs the nvcc compiler to unroll the loop — replacing the loop with N copies of the loop body, eliminating the loop counter, branch instructions, and loop overhead. For example, a loop iterating 4 times becomes 4 sequential copies of the body. This increases instruction-level parallelism (ILP) and removes branch overhead, but increases code size and register usage. You can also specify partial unrolling with #pragma unroll N. It does NOT parallelize iterations across threads, move code to CPU, or create recursion.

10

What is instruction-level parallelism (ILP) and why does it matter in CUDA kernels?

A) ILP means running multiple kernels concurrently — it improves GPU utilization

B) ILP means a single thread has multiple independent instructions that can execute simultaneously in the pipeline, reducing stalls

C) ILP means distributing instructions across multiple warps to balance load

D) ILP means using SIMD instructions to process 32 elements at once

Instruction-Level Parallelism (ILP) refers to a single thread having multiple independent instructions that can overlap in the GPU's execution pipeline. For example, if a thread computes a = x + y and b = w + z, these are independent and can be pipelined simultaneously. Higher ILP means the hardware can keep functional units busy even within a single thread, reducing pipeline stalls. This is distinct from thread-level parallelism (multiple threads/warps) and SIMT (32 threads in a warp). In practice, loop unrolling and reordering independent computations increase ILP.

11

In a tiling strategy for matrix multiplication, what is the PRIMARY purpose of loading tiles into shared memory?

A) To compress the matrix data and reduce memory usage

B) To exploit data reuse — each element loaded from global memory is used multiple times from fast shared memory

C) To enable dynamic parallelism by launching child kernels per tile

D) To ensure all threads have sequential thread IDs

Tiling loads a small block (tile) of the input matrices from slow global memory into fast shared memory, then all threads in the block repeatedly access those elements from shared memory. In naive matrix multiplication, each element of the input is read from global memory O(N) times. With tiling, each element is read from global memory once and reused O(tile_size) times from shared memory, which is ~100x faster. This dramatically reduces global memory traffic. It has nothing to do with compression, dynamic parallelism, or thread IDs.

12

What is 'register spilling' and why is it harmful to CUDA kernel performance?

A) When the compiler allocates more registers than the hardware supports, excess variables spill to slow local memory (actually global memory)

B) When registers are shared between warps causing data corruption

C) When too many registers cause the GPU to overheat

D) When register values are lost during context switches between warps

Register spilling happens when a kernel needs more registers per thread than the hardware can allocate. The compiler moves excess variables to 'local memory,' which despite the name is actually stored in global memory (with L1/L2 caching). Since global memory is ~100x slower than registers, spilling introduces massive latency penalties. You can detect spilling by checking nvcc output with --ptxas-options=-v which reports register usage and local memory (lmem) usage. Reducing register pressure through algorithmic simplification or --maxrregcount can help, but may reduce ILP. Registers are NOT shared between warps — each warp has its own register set.

13

What causes a shared memory bank conflict?

A) Two threads in the same warp access the same byte in shared memory

B) Two threads in the same warp access different addresses that map to the same shared memory bank

C) Two different blocks access the same shared memory address

D) The total shared memory requested exceeds the SM's capacity

Shared memory is divided into 32 banks (one per warp thread). A bank conflict occurs when two or more threads in the same warp access DIFFERENT addresses that map to the SAME bank. This serializes those accesses — a 2-way conflict takes 2 cycles instead of 1. Note: if threads access the EXACT same address (same byte), it's a broadcast and there is NO conflict. Different blocks have separate shared memory, so they can't conflict with each other. Exceeding SM capacity prevents block launch but isn't a bank conflict.

14

Which NVIDIA tool provides detailed kernel-level performance metrics including memory throughput, occupancy, warp stall reasons, and roofline analysis?

A) nvidia-smi

B) cuda-gdb

C) NVIDIA Nsight Compute

D) NVIDIA Visual Profiler (nvvp) on Ampere+ GPUs

NVIDIA Nsight Compute (ncu) is the modern kernel-level profiler that provides deep performance metrics: memory throughput, compute throughput, occupancy, warp stall analysis, roofline charts, source-level correlation, and optimization recommendations. nvidia-smi is a system monitoring tool (utilization, temperature, memory usage) — it doesn't profile individual kernels. cuda-gdb is a debugger, not a profiler. nvvp (Visual Profiler) and nvprof are deprecated and do NOT work on Ampere (sm_80) and newer GPUs — Nsight Compute is their replacement.

15

A kernel achieves 200 GFLOP/s on a GPU with 1 TB/s memory bandwidth. It loads 4 bytes per FLOP. Is this kernel compute-bound or memory-bound according to the roofline model?

A) Compute-bound, because 200 GFLOP/s is a high throughput

B) Memory-bound, because its arithmetic intensity (0.25 FLOP/byte) is below the ridge point

C) Neither — it is perfectly balanced at the ridge point

D) Cannot determine without knowing the GPU's peak compute throughput

Arithmetic intensity = FLOPs / bytes transferred = 1 FLOP / 4 bytes = 0.25 FLOP/byte. The roofline model plots achievable performance vs. arithmetic intensity. The ridge point is where the memory bandwidth ceiling meets the compute ceiling: ridge = peak GFLOP/s / bandwidth (GB/s). For this GPU, even a modest peak of 10 TFLOP/s gives a ridge point of 10,000 / 1,000 = 10 FLOP/byte. Since 0.25 << 10, this kernel is deep in the memory-bound region — its performance is limited by how fast it can move data, NOT by compute. The 200 GFLOP/s is only 200 * 4 = 800 GB/s of memory demand, close to the 1 TB/s bandwidth limit.

🚀 Quiz: Optimization 25 XP Hard