🌳 Chapter 7, Part 1: Parallel Reduction — Divide to Conquer
💡 Story: Summing 1 million numbers on a CPU takes 1 million steps. But what if you had 1 million soldiers? Round 1: pairs add up → 500,000 results. Round 2: pairs add again → 250,000. After only 20 rounds (log₂ 1,000,000 ≈ 20), you're done! That's parallel reduction — the most fundamental GPU algorithm.
// Parallel Reduction for sum — tree-based approach
__global__ void reduce(float* data, float* blockSums, int n) {
__shared__ float sdata[256];
int tid = threadIdx.x;
int gid = threadIdx.x + blockIdx.x * blockDim.x;
// Step 1: Load data into shared memory
sdata[tid] = (gid < n) ? data[gid] : 0.0f;
__syncthreads();
// Step 2: Tree reduction within the block
// Round 1: stride=128 → threads 0-127 add from threads 128-255
// Round 2: stride=64 → threads 0-63 add from threads 64-127
// Round 3: stride=32 → threads 0-31 add from threads 32-63
// ...and so on, halving active threads each round
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
if (tid < stride) {
sdata[tid] += sdata[tid + stride]; // Combine pair
}
__syncthreads(); // Wait before next round!
}
// Step 3: Thread 0 holds the block's total
if (tid == 0) blockSums[blockIdx.x] = sdata[0];
// Final sum = add all blockSums on CPU (or launch another reduction kernel)
}
// Visualizing 8-element reduction:
// Data: [3, 1, 4, 1, 5, 9, 2, 6]
// Round 1 (stride=4): [3+5, 1+9, 4+2, 1+6, _, _, _, _] = [8, 10, 6, 7]
// Round 2 (stride=2): [8+6, 10+7, _, _, _, _, _, _] = [14, 17]
// Round 3 (stride=1): [14+17, _, _, _, _, _, _, _] = [31]
// Answer: 31 ✓ (3+1+4+1+5+9+2+6 = 31)
📋 Instructions
Simulate parallel reduction on [3, 1, 4, 1, 5, 9, 2, 6] and show all intermediate steps:
```
=== Parallel Reduction (Sum) ===
Input: [3, 1, 4, 1, 5, 9, 2, 6]
Elements: 8, Rounds needed: 3
Round 1 (stride=4): [8, 10, 6, 7, _, _, _, _]
Round 2 (stride=2): [14, 17, _, _, _, _, _, _]
Round 3 (stride=1): [31, _, _, _, _, _, _, _]
Final Sum: 31
Serial steps (CPU): 7
Parallel rounds (GPU): 3
Speedup factor: 2.33x
```
The loop 'for stride = n/2 down to 1' is the tree reduction. After each round, active threads halve. After log2(8)=3 rounds, arr[0] holds the total. The starter code's print loop needs fixing to show '_' for inactive positions — update the print condition to: `i < n/(n/(stride*2))` or simplify by tracking active length.