CUDA Programming Optimization — Make It Blazing Fast
💡
Exercise 28

Occupancy & Warps 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

📊 Chapter 6, Part 3: Occupancy — Keeping Your Army Busy

💡 Story: Imagine your GPU's each SM (Streaming Multiprocessor) is a factory floor that can hold 2048 workers. You have squads (warps) of 32. You want the floor as FULL as possible — that's occupancy! An SM sitting half-empty wastes GPU power. Max occupancy = max performance potential.

// ---- KEY NUMBERS (Ampere architecture, RTX 30xx) ---- // Max threads per SM: 2048 // Max warps per SM: 64 (2048 / 32) // Max blocks per SM: 32 // Max threads per block: 1024 // Warp size: 32 // If we launch blocks of 256 threads each (8 warps per block): // Blocks per SM = min(32, floor(64 / 8)) = min(32, 8) = 8 blocks // Active warps = 8 blocks * 8 warps = 64 warps // Occupancy = 64 / 64 = 100% ← IDEAL! // If we launch blocks of 32 threads each (1 warp per block): // Blocks per SM = min(32, 64) = 32 // Active warps = 32 * 1 = 32 // Occupancy = 32 / 64 = 50% ← POOR // REGISTERS also limit occupancy: // If kernel uses 64 registers/thread: // Total registers per SM: 65536 // Threads per SM = 65536 / 64 = 1024 → only 50% occupancy! __global__ void registerHeavy() { float a,b,c,d,e,f,g,h; // More registers → fewer threads fit per SM // ... uses 64+ registers → limits occupancy }

The three limits on occupancy:

  • 🧵 Thread/block count — Max 2048 threads and 32 blocks per SM
  • 📦 Registers — Each thread uses registers; all threads on SM share 65536 registers
  • 🏦 Shared memory — If kernel uses 24KB shared memory, and SM has 48KB, only 2 blocks fit!
  • 📐 Block size sweet spot — Usually 128-256 threads per block for best occupancy
  • 🔧 Tool — CUDA Occupancy Calculator (in CUDA toolkit) helps find optimal launch config
// Using the CUDA occupancy API at runtime: #include <cuda_runtime.h> int main() { int blockSize = 256; int minGridSize, bestBlockSize; // Let CUDA suggest optimal block size for your kernel! cudaOccupancyMaxPotentialBlockSize( &minGridSize, // min grid size for full occupancy &bestBlockSize, // optimal block size myKernel, // your kernel function 0, // dynamic shared memory 0 // max block size (0 = no limit) ); printf("Best block size: %d\n", bestBlockSize); printf("Min grid size: %d\n", minGridSize); // Launch with these values for maximum occupancy! myKernel<<<minGridSize, bestBlockSize>>>(data, n); }
📋 Instructions
Calculate occupancy for 3 launch configurations on an SM that supports max 2048 threads, max 64 warps, max 32 blocks: ``` === GPU Occupancy Calculator === SM Capacity: 2048 threads, 64 warps, 32 blocks [Config 1] Block size: 256 threads (8 warps) Blocks fitting (warp limit): 64/8 = 8 Blocks fitting (block limit): 32 Active blocks: 8 Active warps: 64 Occupancy: 100% [Config 2] Block size: 32 threads (1 warp) Blocks fitting (warp limit): 64/1 = 64 -> capped at 32 Blocks fitting (block limit): 32 Active blocks: 32 Active warps: 32 Occupancy: 50% [Config 3] Block size: 1024 threads (32 warps) Blocks fitting (warp limit): 64/32 = 2 Blocks fitting (block limit): 32 Active blocks: 2 Active warps: 64 Occupancy: 100% Best config: 256 or 1024 threads per block ```
Run the code as-is. Notice that a tiny block size (32) wastes SM capacity because the 32-block cap is hit. A balanced block size like 256 saturates both limits perfectly. Higher occupancy generally means better latency hiding.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.