CUDA Programming Optimization — Make It Blazing Fast
💡
Exercise 27

Warp Divergence 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🔀 Chapter 6, Part 2: Warp Divergence — When Your Army Splits Up

💡 Story: Remember warps — 32 soldiers that ALWAYS execute the same instruction? Now imagine half the warp needs to go left and half needs to go right at a fork in the road. They CAN'T split up! So half goes left while the other half WAITS. Then they swap. Two trips! That's warp divergence.

// ❌ DIVERGENT: Even/odd threads take different paths __global__ void divergent(float* arr, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i % 2 == 0) { // Even threads do A arr[i] = arr[i] * 2; // Step 1 arr[i] += 1; // Step 2 } else { // Odd threads do B arr[i] = arr[i] / 2; // Step 3 arr[i] -= 1; // Step 4 } // Execution: Warp runs A for evens (odds idle), then runs B for odds (evens idle) // 2x slower than if all threads took the same path! } // ✅ NO DIVERGENCE: All threads in a warp take the same path __global__ void noDivergence(float* arr, int n, int mode) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (mode == 0) { // ENTIRE kernel uses same branch if (i < n) arr[i] *= 2; } else { if (i < n) arr[i] /= 2; } }

How divergence actually works:

  • 🔀 Threads in a warp diverge — They take different if/else paths
  • 🎭 GPU uses masks — It runs the 'if' path with non-if threads MASKED OUT (inactive)
  • 🔄 Then re-runs — The 'else' path with if-threads masked out
  • ⏱️ Time cost — Worst case: 2× slower (50% of threads idle at all times)
  • 🔧 Fix — Reorganize data so threads within a warp take the same path
// OPTIMIZATION: Align divergence with warp boundaries // Instead of: if (i % 2 == 0) → diverges within every warp! // Better: Assign all 'even' work to first half of threads in each block __global__ void optimized(float* a, float* b, float* out, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; // Warp 0 (threads 0-31) → all process 'a' array // Warp 1 (threads 32-63) → all process 'b' array // No divergence within a warp! if (threadIdx.x < blockDim.x / 2) { // First half of block out[i] = a[i] * 2.0f; } else { // Second half of block out[i] = b[i - blockDim.x/2] * 3.0f; } // Threads 0-31 are in same warp, threads 32-63 in same warp // Each warp takes ONE consistent path → no divergence! }
📋 Instructions
Simulate warp divergence performance. For a warp of 8 threads (simplified), measure 'execution cycles' when divergent vs non-divergent: ``` === Warp Divergence Simulation === Warp size (simplified): 8 threads --- Divergent code (if i%2==0) --- Round 1: Even threads (0,2,4,6) execute branch A [Odds IDLE] Round 2: Odd threads (1,3,5,7) execute branch B [Evens IDLE] Execution rounds: 2 Efficiency: 50% --- Non-divergent code --- Round 1: ALL threads execute same path Execution rounds: 1 Efficiency: 100% Speedup from removing divergence: 2.00x ```
Run the code as-is. Change 100/divergentRounds from integer to give 50% for 2 rounds. The key lesson: splitting a warp into 2 paths doubles execution time. Minimize branches that split threads within the same warp.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.