CUDA Programming Parallel Patterns
💡
Exercise 32

Parallel Prefix Scan 25 XP Hard

Ctrl+Enter Run Ctrl+S Save

🔢 Chapter 7, Part 2: Prefix Scan — Every Result Builds on the Last

💡 Story: Imagine 8 scouts each counting gold coins in their sector. The scout in sector 2 wants to know the TOTAL found by all scouts in sectors 0, 1, and 2 combined. Sector 3 wants sectors 0-3. Every scout wants a running total! That's a prefix scan — and GPUs can compute ALL 8 running totals simultaneously.

// Exclusive prefix scan (scan[i] = sum of all elements BEFORE i) // Input: [1, 2, 3, 4] // Output: [0, 1, 3, 6] ← each output is sum of previous inputs // Inclusive prefix scan (scan[i] = sum INCLUDING element i) // Input: [1, 2, 3, 4] // Output: [1, 3, 6, 10] ← each output includes current element __global__ void inclusiveScan(int* data, int* out, int n) { __shared__ int sdata[256]; int tid = threadIdx.x; sdata[tid] = (tid < n) ? data[tid] : 0; __syncthreads(); // Up-sweep (reduce) phase — build partial sums for (int stride = 1; stride < n; stride <<= 1) { int val = 0; if (tid >= stride) val = sdata[tid - stride]; __syncthreads(); sdata[tid] += val; // Add left neighbor at distance=stride __syncthreads(); } // After this loop, sdata[i] = sum of data[0..i] (inclusive scan) if (tid < n) out[tid] = sdata[tid]; }

Where is prefix scan used?

  • 📦 Stream compaction — Filter array, keep only matching elements (scatter indices)
  • 📊 Histogram normalization — Cumulative distribution function (CDF) for image equalization
  • 🧮 Parallel sorting — Radix sort uses prefix scan for offset computation
  • 🗜️ Data compression — Run-length encoding with parallel output addressing
  • 🤖 Deep learning — CTC (Connectionist Temporal Classification) loss functions
// Visualizing inclusive scan on [1, 2, 3, 4, 5, 6, 7, 8]: // Step 1 (stride=1): each element += left neighbor // [1, 1+2, 2+3, 3+4, 4+5, 5+6, 6+7, 7+8] = [1, 3, 5, 7, 9, 11, 13, 15] // Step 2 (stride=2): each element += element 2 positions left // [1, 3, 1+5, 3+7, 9, 11, 9+13, 11+15] = [1, 3, 6, 10, 9, 11, 22, 26] // Hmm, this needs careful tracking — the Hillis-Steele algorithm! // Full scan of [1,2,3,4,5,6,7,8] → [1,3,6,10,15,21,28,36]
📋 Instructions
Compute both exclusive and inclusive prefix scans on [1, 2, 3, 4, 5]: ``` === Prefix Scan Demo === Input: [1, 2, 3, 4, 5] Inclusive Scan (includes self): out[0] = 1 out[1] = 3 out[2] = 6 out[3] = 10 out[4] = 15 Result: [1, 3, 6, 10, 15] Exclusive Scan (sum of all BEFORE): out[0] = 0 out[1] = 1 out[2] = 3 out[3] = 6 out[4] = 10 Result: [0, 1, 3, 6, 10] Note: exclusive[i] = inclusive[i-1] (with exclusive[0] = 0) ```
Inclusive scan: out[i] = sum of data[0..i]. Serial loop: incl[0]=data[0], incl[i]=incl[i-1]+data[i]. Exclusive: excl[0]=0, excl[i]=incl[i-1]. In GPU, the Hillis-Steele algorithm does this in log₂(n) rounds with all-thread parallelism.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.