Warp Divergence 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🔀 Chapter 6, Part 2: Warp Divergence — When Your Army Splits Up

💡 Story: Remember warps — 32 soldiers that ALWAYS execute the same instruction? Now imagine half the warp needs to go left and half needs to go right at a fork in the road. They CAN'T split up! So half goes left while the other half WAITS. Then they swap. Two trips! That's warp divergence.

// ❌ DIVERGENT: Even/odd threads take different paths
__global__ void divergent(float* arr, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i % 2 == 0) {        // Even threads do A
        arr[i] = arr[i] * 2; // Step 1
        arr[i] += 1;          // Step 2
    } else {                  // Odd threads do B
        arr[i] = arr[i] / 2; // Step 3
        arr[i] -= 1;          // Step 4  
    }
    // Execution: Warp runs A for evens (odds idle), then runs B for odds (evens idle)
    // 2x slower than if all threads took the same path!
}

// ✅ NO DIVERGENCE: All threads in a warp take the same path
__global__ void noDivergence(float* arr, int n, int mode) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (mode == 0) {           // ENTIRE kernel uses same branch
        if (i < n) arr[i] *= 2;
    } else {
        if (i < n) arr[i] /= 2;
    }
}

How divergence actually works:

🔀 Threads in a warp diverge — They take different if/else paths
🎭 GPU uses masks — It runs the 'if' path with non-if threads MASKED OUT (inactive)
🔄 Then re-runs — The 'else' path with if-threads masked out
⏱️ Time cost — Worst case: 2× slower (50% of threads idle at all times)
🔧 Fix — Reorganize data so threads within a warp take the same path

// OPTIMIZATION: Align divergence with warp boundaries
// Instead of: if (i % 2 == 0) → diverges within every warp!
// Better: Assign all 'even' work to first half of threads in each block

__global__ void optimized(float* a, float* b, float* out, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    // Warp 0 (threads 0-31)  → all process 'a' array
    // Warp 1 (threads 32-63) → all process 'b' array
    // No divergence within a warp!
    if (threadIdx.x < blockDim.x / 2) {  // First half of block
        out[i] = a[i] * 2.0f;
    } else {                              // Second half of block
        out[i] = b[i - blockDim.x/2] * 3.0f;
    }
    // Threads 0-31 are in same warp, threads 32-63 in same warp
    // Each warp takes ONE consistent path → no divergence!
}

📋 Instructions

Simulate warp divergence performance. For a warp of 8 threads (simplified), measure 'execution cycles' when divergent vs non-divergent: ``` === Warp Divergence Simulation === Warp size (simplified): 8 threads --- Divergent code (if i%2==0) --- Round 1: Even threads (0,2,4,6) execute branch A [Odds IDLE] Round 2: Odd threads (1,3,5,7) execute branch B [Evens IDLE] Execution rounds: 2 Efficiency: 50% --- Non-divergent code --- Round 1: ALL threads execute same path Execution rounds: 1 Efficiency: 100% Speedup from removing divergence: 2.00x ```

Run the code as-is. Change 100/divergentRounds from integer to give 50% for 2 rounds. The key lesson: splitting a warp into 2 paths doubles execution time. Minimize branches that split threads within the same warp.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›