Parallel Reduction 25 XP Hard

Ctrl+Enter Run Ctrl+S Save

🌳 Chapter 7, Part 1: Parallel Reduction — Divide to Conquer

💡 Story: Summing 1 million numbers on a CPU takes 1 million steps. But what if you had 1 million soldiers? Round 1: pairs add up → 500,000 results. Round 2: pairs add again → 250,000. After only 20 rounds (log₂ 1,000,000 ≈ 20), you're done! That's parallel reduction — the most fundamental GPU algorithm.

// Parallel Reduction for sum — tree-based approach
__global__ void reduce(float* data, float* blockSums, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int gid = threadIdx.x + blockIdx.x * blockDim.x;
    
    // Step 1: Load data into shared memory
    sdata[tid] = (gid < n) ? data[gid] : 0.0f;
    __syncthreads();
    
    // Step 2: Tree reduction within the block
    // Round 1: stride=128 → threads 0-127 add from threads 128-255
    // Round 2: stride=64  → threads 0-63  add from threads 64-127
    // Round 3: stride=32  → threads 0-31  add from threads 32-63
    // ...and so on, halving active threads each round
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) {
            sdata[tid] += sdata[tid + stride]; // Combine pair
        }
        __syncthreads(); // Wait before next round!
    }
    
    // Step 3: Thread 0 holds the block's total
    if (tid == 0) blockSums[blockIdx.x] = sdata[0];
    // Final sum = add all blockSums on CPU (or launch another reduction kernel)
}

Why log₂(n) rounds?

🔢 n elements → Round 1 produces n/2 partial sums
📉 Each round halves — n/4, n/8, ..., 1
⏱️ Total rounds = log₂(n) — For 1M elements: ~20 rounds vs 1M serial steps
🚀 Speedup = n / log₂(n) — Enormous for large n!
🧩 Works for any associative op — Sum, max, min, product, AND, OR, XOR

// Visualizing 8-element reduction:
// Data:     [3, 1, 4, 1, 5, 9, 2, 6]
// Round 1 (stride=4): [3+5, 1+9, 4+2, 1+6, _, _, _, _] = [8, 10, 6, 7]
// Round 2 (stride=2): [8+6, 10+7, _, _, _, _, _, _]     = [14, 17]
// Round 3 (stride=1): [14+17, _, _, _, _, _, _, _]       = [31]
// Answer: 31 ✓ (3+1+4+1+5+9+2+6 = 31)

📋 Instructions

Simulate parallel reduction on [3, 1, 4, 1, 5, 9, 2, 6] and show all intermediate steps: ``` === Parallel Reduction (Sum) === Input: [3, 1, 4, 1, 5, 9, 2, 6] Elements: 8, Rounds needed: 3 Round 1 (stride=4): [8, 10, 6, 7, _, _, _, _] Round 2 (stride=2): [14, 17, _, _, _, _, _, _] Round 3 (stride=1): [31, _, _, _, _, _, _, _] Final Sum: 31 Serial steps (CPU): 7 Parallel rounds (GPU): 3 Speedup factor: 2.33x ```

The loop 'for stride = n/2 down to 1' is the tree reduction. After each round, active threads halve. After log2(8)=3 rounds, arr[0] holds the total. The starter code's print loop needs fixing to show '_' for inactive positions — update the print condition to: `i < n/(n/(stride*2))` or simplify by tracking active length.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›