Occupancy & Warps 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

📊 Chapter 6, Part 3: Occupancy — Keeping Your Army Busy

💡 Story: Imagine your GPU's each SM (Streaming Multiprocessor) is a factory floor that can hold 2048 workers. You have squads (warps) of 32. You want the floor as FULL as possible — that's occupancy! An SM sitting half-empty wastes GPU power. Max occupancy = max performance potential.

// ---- KEY NUMBERS (Ampere architecture, RTX 30xx) ----
// Max threads per SM:      2048
// Max warps per SM:        64   (2048 / 32)
// Max blocks per SM:       32
// Max threads per block:   1024
// Warp size:               32

// If we launch blocks of 256 threads each (8 warps per block):
// Blocks per SM = min(32, floor(64 / 8)) = min(32, 8) = 8 blocks
// Active warps = 8 blocks * 8 warps = 64 warps
// Occupancy = 64 / 64 = 100%   ← IDEAL!

// If we launch blocks of 32 threads each (1 warp per block):
// Blocks per SM = min(32, 64) = 32
// Active warps = 32 * 1 = 32
// Occupancy = 32 / 64 = 50%   ← POOR

// REGISTERS also limit occupancy:
// If kernel uses 64 registers/thread:
// Total registers per SM: 65536
// Threads per SM = 65536 / 64 = 1024 → only 50% occupancy!
__global__ void registerHeavy() {
    float a,b,c,d,e,f,g,h; // More registers → fewer threads fit per SM
    // ... uses 64+ registers → limits occupancy
}

The three limits on occupancy:

🧵 Thread/block count — Max 2048 threads and 32 blocks per SM
📦 Registers — Each thread uses registers; all threads on SM share 65536 registers
🏦 Shared memory — If kernel uses 24KB shared memory, and SM has 48KB, only 2 blocks fit!
📐 Block size sweet spot — Usually 128-256 threads per block for best occupancy
🔧 Tool — CUDA Occupancy Calculator (in CUDA toolkit) helps find optimal launch config

// Using the CUDA occupancy API at runtime:
#include <cuda_runtime.h>

int main() {
    int blockSize = 256;
    int minGridSize, bestBlockSize;
    
    // Let CUDA suggest optimal block size for your kernel!
    cudaOccupancyMaxPotentialBlockSize(
        &minGridSize,         // min grid size for full occupancy
        &bestBlockSize,        // optimal block size
        myKernel,             // your kernel function
        0,                    // dynamic shared memory
        0                     // max block size (0 = no limit)
    );
    
    printf("Best block size: %d\n", bestBlockSize);
    printf("Min grid size: %d\n", minGridSize);
    // Launch with these values for maximum occupancy!
    myKernel<<<minGridSize, bestBlockSize>>>(data, n);
}

📋 Instructions

Calculate occupancy for 3 launch configurations on an SM that supports max 2048 threads, max 64 warps, max 32 blocks: ``` === GPU Occupancy Calculator === SM Capacity: 2048 threads, 64 warps, 32 blocks [Config 1] Block size: 256 threads (8 warps) Blocks fitting (warp limit): 64/8 = 8 Blocks fitting (block limit): 32 Active blocks: 8 Active warps: 64 Occupancy: 100% [Config 2] Block size: 32 threads (1 warp) Blocks fitting (warp limit): 64/1 = 64 -> capped at 32 Blocks fitting (block limit): 32 Active blocks: 32 Active warps: 32 Occupancy: 50% [Config 3] Block size: 1024 threads (32 warps) Blocks fitting (warp limit): 64/32 = 2 Blocks fitting (block limit): 32 Active blocks: 2 Active warps: 64 Occupancy: 100% Best config: 256 or 1024 threads per block ```

Run the code as-is. Notice that a tiny block size (32) wastes SM capacity because the 32-block cap is hit. A balanced block size like 256 saturates both limits perfectly. Higher occupancy generally means better latency hiding.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

void calcOccupancy(int blockSize, int maxThreadsSM, int maxWarpsSM, int maxBlocksSM) {
    int warpSize = 32;
    int warpsPerBlock = blockSize / warpSize;
    int blocksFromWarps = maxWarpsSM / warpsPerBlock;
    int activeBlocks = blocksFromWarps < maxBlocksSM ? blocksFromWarps : maxBlocksSM;
    int activeWarps = activeBlocks * warpsPerBlock;
    int occupancy = (activeWarps * 100) / maxWarpsSM;
    
    printf("[Config] Block size: %d threads (%d warps)\n", blockSize, warpsPerBlock);
    printf("Blocks fitting (warp limit): %d/%d = %d\n", maxWarpsSM, warpsPerBlock, blocksFromWarps);
    printf("Blocks fitting (block limit): %d\n", maxBlocksSM);
    printf("Active blocks: %d\n", activeBlocks);
    printf("Active warps: %d\n", activeWarps);
    printf("Occupancy: %d%%\n\n", occupancy);
}

int main() {
    int maxThreads = 2048, maxWarps = 64, maxBlocks = 32;
    printf("=== GPU Occupancy Calculator ===\n");
    printf("SM Capacity: %d threads, %d warps, %d blocks\n\n", maxThreads, maxWarps, maxBlocks);
    
    calcOccupancy(256,  maxThreads, maxWarps, maxBlocks);
    calcOccupancy(32,   maxThreads, maxWarps, maxBlocks);
    calcOccupancy(1024, maxThreads, maxWarps, maxBlocks);
    printf("Best config: 256 or 1024 threads per block\n");
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›