CUDA Programming Parallel Patterns
💡
Exercise 35

🔗 Quiz: Parallel Patterns 25 XP Hard

Ctrl+Enter Run Ctrl+S Save

🏆 Chapter 7 Quiz — Parallel Patterns Mastery!

15 questions on the fundamental building blocks of GPU algorithms. Reduction, scan, and histogram are asked in virtually every GPU programming interview!

📋 Instructions
Answer all 15 questions on parallel algorithmic patterns.
Reduction = O(log N) steps. Inclusive scan includes current element. Blelloch = work-efficient scan. Privatized histogram = per-block shared memory histogram to reduce atomic contention.
⚠️ Try solving it yourself first — you'll learn more!
# This is a quiz exercise - use the MCQ interface above!
🧠 Quiz Time
0 / 15 answered
1
How many steps (rounds) does a tree-based parallel reduction take to sum N elements?
A) N steps — each element is added one at a time
B) N/2 steps — we process pairs each round
C) log₂(N) steps — each round halves the number of active elements
D) √N steps — a square-root decomposition is used
In tree-based parallel reduction, each step combines pairs of values, halving the number of remaining elements: N → N/2 → N/4 → ... → 1. This takes log₂(N) steps. For N=1024, that's only 10 steps instead of 1023 serial additions. The work complexity is still O(N) (same total additions), but the step complexity drops from O(N) to O(log N), which is the entire point of parallelizing reduction.
2
In CUDA parallel reduction using shared memory, why is 'sequential addressing' (stride starts large and halves) preferred over 'interleaved addressing' (stride starts at 1 and doubles)?
A) Sequential addressing uses fewer total additions
B) Sequential addressing avoids shared memory bank conflicts because active threads access consecutive addresses
C) Interleaved addressing requires more shared memory
D) Sequential addressing allows more threads to participate in each round
With interleaved addressing (stride=1,2,4,...), the active threads access addresses that are spread apart — e.g., threads 0,2,4,6 access sdata[0],sdata[2],sdata[4],sdata[6] — causing 2-way bank conflicts. With sequential addressing (stride=N/2,N/4,...,1), active threads are contiguous (threads 0,1,2,...,stride-1) and access consecutive shared memory locations sdata[tid] and sdata[tid+stride]. The first access is conflict-free, and the second accesses a contiguous block offset by stride. This eliminates the bank conflicts that plague the interleaved version. Both approaches do the same number of additions.
3
After a parallel reduction within a block of 256 threads, which thread holds the final result?
A) Thread 255 (the last thread in the block)
B) Thread 128 (the middle thread)
C) Thread 0 (the first thread in the block)
D) All 256 threads hold a copy of the final result
In the standard tree reduction with sequential addressing, the stride starts at blockDim.x/2 and halves each round. At each step, threads with index < stride perform the addition: sdata[tid] += sdata[tid + stride]. After all log₂(256) = 8 rounds, only thread 0 remains active, and the final sum sits in sdata[0]. Thread 0 is typically responsible for writing this block-level partial sum to global memory. A second kernel or atomic operation then combines all block results.
4
What does the warp shuffle instruction __shfl_down_sync(mask, val, delta) do?
A) Shifts all values in shared memory down by 'delta' positions
B) Each thread receives the value of 'val' from the lane that is 'delta' positions higher within the same warp, without using shared memory
C) It atomically decrements a counter by 'delta'
D) It moves data from global memory to registers for threads offset by 'delta'
__shfl_down_sync(mask, val, delta) is a warp-level intrinsic where each lane receives the value from the lane that is 'delta' positions higher (lane + delta). For example, lane 0 gets the value from lane delta, lane 1 from lane 1+delta, etc. Lanes near the end of the warp (where lane+delta ≥ warpSize) get their own value unchanged. This enables intra-warp reduction WITHOUT shared memory or __syncthreads() — just a sequence of __shfl_down_sync calls with delta = 16, 8, 4, 2, 1. It's faster because register-to-register communication is cheaper than shared memory round-trips.
5
What is the difference between an inclusive scan and an exclusive scan of [3, 1, 4, 1, 5] using addition?
A) Inclusive: [3, 4, 8, 9, 14]; Exclusive: [0, 3, 4, 8, 9]
B) Inclusive: [0, 3, 4, 8, 9]; Exclusive: [3, 4, 8, 9, 14]
C) Inclusive: [3, 4, 8, 9, 14]; Exclusive: [3, 4, 8, 9, 14] (they are identical)
D) Inclusive: [14, 11, 10, 6, 5]; Exclusive: [14, 11, 10, 6, 5]
Inclusive scan includes the current element in the running total: out[i] = data[0] + data[1] + ... + data[i]. So [3, 3+1, 3+1+4, 3+1+4+1, 3+1+4+1+5] = [3, 4, 8, 9, 14]. Exclusive scan excludes the current element: out[i] = data[0] + ... + data[i-1], with out[0] = 0 (the identity element). So [0, 3, 3+1, 3+1+4, 3+1+4+1] = [0, 3, 4, 8, 9]. Exclusive scan is critical for computing write offsets in stream compaction — out[i] tells you WHERE to write element i.
6
The Blelloch scan algorithm is called 'work-efficient'. What does this mean compared to the Hillis-Steele scan?
A) Blelloch does O(N) total work in O(log N) steps; Hillis-Steele does O(N log N) work in O(log N) steps
B) Blelloch does O(N log N) work; Hillis-Steele does O(N) work
C) Blelloch uses less shared memory than Hillis-Steele
D) Blelloch requires fewer __syncthreads() calls than Hillis-Steele
Blelloch's scan has two phases: an up-sweep (reduce) and a down-sweep, performing O(N) total additions across O(2 log N) steps. Hillis-Steele is simpler — each step, every element adds a value from distance 2^step behind — but ALL N elements do work at every step, totaling O(N log N) additions across O(log N) steps. Hillis-Steele is 'step-efficient' (fewer steps, same as Blelloch) but not 'work-efficient'. For GPUs where you have enough parallelism, Hillis-Steele's simplicity can sometimes win despite extra work, but Blelloch scales better for large arrays.
7
In the Blelloch (work-efficient) exclusive scan, what are the two phases?
A) A map phase followed by a gather phase
B) An up-sweep (parallel reduction) phase followed by a down-sweep (distribution) phase
C) A scatter phase followed by a compact phase
D) A sort phase followed by a merge phase
Blelloch's algorithm has two distinct phases: (1) Up-sweep: a standard parallel reduction that builds partial sums in a tree — after this phase, the last element contains the total sum. (2) Down-sweep: the last element is replaced with 0 (identity), then the tree is traversed top-down, distributing partial sums to produce the exclusive scan. Each phase takes O(log N) steps with O(N) total operations across both. This two-phase structure is why it's called a 'reduce-then-scan' pattern.
8
Why does a naive global-memory GPU histogram using atomicAdd suffer poor performance?
A) atomicAdd is not supported on global memory
B) Massive contention — thousands of threads across all blocks compete to atomically update the same few histogram bins
C) Global memory does not support integer operations
D) The histogram bins overflow because global memory is 8-bit only
When thousands of threads from many blocks all call atomicAdd on the same global memory bins, the atomic unit serializes access — only one thread at a time can update a given bin. For skewed distributions where most values map to a few bins, this contention is catastrophic. The solution is privatized histograms: each block builds its own histogram in shared memory (only ~256 threads compete), then a single pass merges per-block histograms into global memory. This reduces contention by a factor of (total_threads / block_size).
9
What is a 'privatized histogram' in CUDA and how does it reduce atomic contention?
A) Each thread maintains its own histogram in registers, merged at the end via warp shuffle
B) Each block maintains a private histogram in shared memory; only threads within the same block compete for atomics, then blocks merge to global
C) A single block computes the entire histogram to avoid inter-block contention
D) The histogram bins are replicated across multiple GPUs
Privatized histograms allocate one histogram copy per block in shared memory. Phase 1: each thread in a block uses atomicAdd on the shared-memory histogram — contention is limited to blockDim.x threads (e.g., 256) instead of the entire grid. Phase 2: threads in each block add their shared-memory histogram bins into the global histogram using atomicAdd. Since phase 2 has only (num_blocks × num_bins) atomics instead of (total_elements) atomics, contention drops dramatically. Option A (per-thread in registers) is possible for very small bin counts but impractical for typical histograms.
10
In a 2D stencil computation (e.g., a 5-point stencil), what are 'halo cells' (ghost zones)?
A) Extra threads launched beyond the data boundary to handle padding
B) Border elements from neighboring tiles loaded into a block's shared memory so boundary threads can access their neighbors
C) Unused elements at the center of each tile reserved for alignment
D) Special hardware registers that cache frequently accessed stencil coefficients
In tiled stencil computation, each block loads a tile of data into shared memory. But threads at the tile boundary need neighbor values that belong to adjacent tiles. Halo cells are the extra border elements loaded from neighboring tiles into shared memory. For a radius-1 stencil (5-point), each tile of size T×T must load (T+2)×(T+2) elements — the extra ring of width 1 around the tile is the halo. Without halo loading, boundary threads would need slow global memory reads for every neighbor access, defeating the purpose of tiling.
11
What is 'stream compaction' (parallel compact) and which parallel primitive is essential for implementing it?
A) Removing all zero elements from an array; requires parallel sort
B) Keeping only elements that satisfy a predicate and packing them contiguously; requires exclusive prefix scan
C) Compressing an array using run-length encoding; requires parallel reduction
D) Dividing an array into equal-sized chunks for load balancing; requires parallel partition
Stream compaction takes an input array and a predicate (e.g., x > 0), and produces an output array containing only elements that pass the predicate, packed contiguously with no gaps. The algorithm: (1) Evaluate the predicate → produce a flags array of 0s and 1s. (2) Exclusive prefix scan on flags → produces write indices. (3) Scatter: if flags[i]==1, write input[i] to output[scan[i]]. The exclusive scan is essential because scan[i] gives the exact output position for element i. This pattern is fundamental in ray tracing (compacting active rays), particle simulations, and sparse computations.
12
Which correctly describes the 'map', 'scatter', and 'gather' parallel patterns?
A) Map: apply a function to each element independently; Gather: read from irregular source locations into contiguous output; Scatter: write to irregular destination locations from contiguous input
B) Map: sort elements; Gather: collect results from multiple GPUs; Scatter: distribute work to blocks
C) Map: reduce elements; Gather: prefix scan; Scatter: histogram
D) All three are synonyms for the same elementwise operation
These are three fundamental data-parallel patterns. Map: out[i] = f(in[i]) — each element transformed independently, embarrassingly parallel. Gather: out[i] = in[index[i]] — reads from irregular (indexed) locations, writes contiguously. This is cache-unfriendly for reads but coalesced for writes. Scatter: out[index[i]] = in[i] — reads contiguously, writes to irregular locations. Scatter can have write conflicts if two elements map to the same index (solved with atomics). On GPUs, gather is generally preferred over scatter because coalesced writes are easier to achieve than coalesced reads.
13
In the 'reduce-then-scan' pattern for large arrays that don't fit in a single block, what is the correct sequence of operations?
A) One global scan kernel solves the entire array in a single pass
B) (1) Each block scans its chunk locally, (2) block-level totals are reduced, (3) a scan of block totals is computed, (4) block-level scan results are added back to each block's elements
C) (1) Sort the array, (2) reduce adjacent pairs, (3) output the result
D) (1) Partition into warps, (2) each warp writes directly to output
For arrays larger than one block, a single block-level scan is insufficient. The reduce-then-scan (also called 'scan-then-propagate') approach: (1) Each block performs a local scan of its chunk and writes the block's total sum to an auxiliary array. (2) The auxiliary array of block totals is scanned (recursively if needed). (3) Each block adds the scanned block-total prefix to all its elements. This produces a correct global scan in O(N) work and O(log²N) steps. Libraries like CUB implement this pattern with optimizations like decoupled look-back to achieve single-pass performance.
14
What is the trade-off between 'work complexity' and 'step complexity' in parallel algorithms, and how does it apply to Hillis-Steele vs. Blelloch scan?
A) Work complexity = total operations performed; Step complexity = number of parallel rounds. Hillis-Steele has fewer steps but more total work; Blelloch has optimal work but more steps
B) Work complexity and step complexity are the same metric, just named differently
C) Work complexity refers to memory usage; step complexity refers to the number of kernel launches
D) Hillis-Steele has both lower work and step complexity, making it strictly superior
Work complexity measures the total number of operations (analogous to sequential time). Step complexity measures the number of parallel rounds (depth of the computation DAG). Hillis-Steele: O(N log N) work, O(log N) steps — it's step-efficient (minimum parallel depth) but does extra redundant work. Blelloch: O(N) work, O(2 log N) steps — it's work-efficient (matches sequential cost) but has slightly higher depth constant. On a GPU with limited parallelism (fewer processors than N), Blelloch is usually better because extra work translates to real time. With unlimited processors, Hillis-Steele's lower step count wins.
15
Which statement about the CUB and Thrust libraries for parallel patterns in CUDA is correct?
A) CUB provides low-level, block-cooperative primitives (BlockReduce, BlockScan) with explicit control; Thrust provides high-level STL-like algorithms (thrust::reduce, thrust::exclusive_scan) that manage launches automatically
B) CUB is a Python library; Thrust is a CUDA C++ library
C) Thrust is lower-level than CUB and requires manual shared memory management
D) Neither CUB nor Thrust supports parallel scan — developers must write custom kernels
CUB (CUDA UnBound) provides reusable, block-level and device-level primitives like cub::BlockReduce, cub::BlockScan, cub::DeviceReduce, and cub::DeviceHistogram. It gives developers fine-grained control over shared memory usage, tile sizes, and algorithm selection. Thrust is a higher-level, STL-like parallel algorithms library — you call thrust::reduce(d_vec.begin(), d_vec.end()) and it handles kernel configuration, memory management, and algorithm selection automatically. Both are shipped with the CUDA Toolkit. CUB is ideal when you need to embed scan/reduce inside your own kernels; Thrust is ideal for rapid prototyping and when performance tuning is less critical.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.