CUDA Programming Parallel Patterns
💡
Exercise 31

Parallel Reduction 25 XP Hard

Ctrl+Enter Run Ctrl+S Save

🌳 Chapter 7, Part 1: Parallel Reduction — Divide to Conquer

💡 Story: Summing 1 million numbers on a CPU takes 1 million steps. But what if you had 1 million soldiers? Round 1: pairs add up → 500,000 results. Round 2: pairs add again → 250,000. After only 20 rounds (log₂ 1,000,000 ≈ 20), you're done! That's parallel reduction — the most fundamental GPU algorithm.

// Parallel Reduction for sum — tree-based approach __global__ void reduce(float* data, float* blockSums, int n) { __shared__ float sdata[256]; int tid = threadIdx.x; int gid = threadIdx.x + blockIdx.x * blockDim.x; // Step 1: Load data into shared memory sdata[tid] = (gid < n) ? data[gid] : 0.0f; __syncthreads(); // Step 2: Tree reduction within the block // Round 1: stride=128 → threads 0-127 add from threads 128-255 // Round 2: stride=64 → threads 0-63 add from threads 64-127 // Round 3: stride=32 → threads 0-31 add from threads 32-63 // ...and so on, halving active threads each round for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) { if (tid < stride) { sdata[tid] += sdata[tid + stride]; // Combine pair } __syncthreads(); // Wait before next round! } // Step 3: Thread 0 holds the block's total if (tid == 0) blockSums[blockIdx.x] = sdata[0]; // Final sum = add all blockSums on CPU (or launch another reduction kernel) }

Why log₂(n) rounds?

  • 🔢 n elements → Round 1 produces n/2 partial sums
  • 📉 Each round halves — n/4, n/8, ..., 1
  • ⏱️ Total rounds = log₂(n) — For 1M elements: ~20 rounds vs 1M serial steps
  • 🚀 Speedup = n / log₂(n) — Enormous for large n!
  • 🧩 Works for any associative op — Sum, max, min, product, AND, OR, XOR
// Visualizing 8-element reduction: // Data: [3, 1, 4, 1, 5, 9, 2, 6] // Round 1 (stride=4): [3+5, 1+9, 4+2, 1+6, _, _, _, _] = [8, 10, 6, 7] // Round 2 (stride=2): [8+6, 10+7, _, _, _, _, _, _] = [14, 17] // Round 3 (stride=1): [14+17, _, _, _, _, _, _, _] = [31] // Answer: 31 ✓ (3+1+4+1+5+9+2+6 = 31)
📋 Instructions
Simulate parallel reduction on [3, 1, 4, 1, 5, 9, 2, 6] and show all intermediate steps: ``` === Parallel Reduction (Sum) === Input: [3, 1, 4, 1, 5, 9, 2, 6] Elements: 8, Rounds needed: 3 Round 1 (stride=4): [8, 10, 6, 7, _, _, _, _] Round 2 (stride=2): [14, 17, _, _, _, _, _, _] Round 3 (stride=1): [31, _, _, _, _, _, _, _] Final Sum: 31 Serial steps (CPU): 7 Parallel rounds (GPU): 3 Speedup factor: 2.33x ```
The loop 'for stride = n/2 down to 1' is the tree reduction. After each round, active threads halve. After log2(8)=3 rounds, arr[0] holds the total. The starter code's print loop needs fixing to show '_' for inactive positions — update the print condition to: `i < n/(n/(stride*2))` or simplify by tracking active length.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.