🔀 Chapter 6, Part 2: Warp Divergence — When Your Army Splits Up
💡 Story: Remember warps — 32 soldiers that ALWAYS execute the same instruction? Now imagine half the warp needs to go left and half needs to go right at a fork in the road. They CAN'T split up! So half goes left while the other half WAITS. Then they swap. Two trips! That's warp divergence.
// ❌ DIVERGENT: Even/odd threads take different paths
__global__ void divergent(float* arr, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i % 2 == 0) { // Even threads do A
arr[i] = arr[i] * 2; // Step 1
arr[i] += 1; // Step 2
} else { // Odd threads do B
arr[i] = arr[i] / 2; // Step 3
arr[i] -= 1; // Step 4
}
// Execution: Warp runs A for evens (odds idle), then runs B for odds (evens idle)
// 2x slower than if all threads took the same path!
}
// ✅ NO DIVERGENCE: All threads in a warp take the same path
__global__ void noDivergence(float* arr, int n, int mode) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (mode == 0) { // ENTIRE kernel uses same branch
if (i < n) arr[i] *= 2;
} else {
if (i < n) arr[i] /= 2;
}
}
How divergence actually works:
// OPTIMIZATION: Align divergence with warp boundaries
// Instead of: if (i % 2 == 0) → diverges within every warp!
// Better: Assign all 'even' work to first half of threads in each block
__global__ void optimized(float* a, float* b, float* out, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
// Warp 0 (threads 0-31) → all process 'a' array
// Warp 1 (threads 32-63) → all process 'b' array
// No divergence within a warp!
if (threadIdx.x < blockDim.x / 2) { // First half of block
out[i] = a[i] * 2.0f;
} else { // Second half of block
out[i] = b[i - blockDim.x/2] * 3.0f;
}
// Threads 0-31 are in same warp, threads 32-63 in same warp
// Each warp takes ONE consistent path → no divergence!
}
📋 Instructions
Simulate warp divergence performance. For a warp of 8 threads (simplified), measure 'execution cycles' when divergent vs non-divergent:
```
=== Warp Divergence Simulation ===
Warp size (simplified): 8 threads
--- Divergent code (if i%2==0) ---
Round 1: Even threads (0,2,4,6) execute branch A [Odds IDLE]
Round 2: Odd threads (1,3,5,7) execute branch B [Evens IDLE]
Execution rounds: 2
Efficiency: 50%
--- Non-divergent code ---
Round 1: ALL threads execute same path
Execution rounds: 1
Efficiency: 100%
Speedup from removing divergence: 2.00x
```
Run the code as-is. Change 100/divergentRounds from integer to give 50% for 2 rounds. The key lesson: splitting a warp into 2 paths doubles execution time. Minimize branches that split threads within the same warp.