📊 Chapter 6, Part 3: Occupancy — Keeping Your Army Busy
💡 Story: Imagine your GPU's each SM (Streaming Multiprocessor) is a factory floor that can hold 2048 workers. You have squads (warps) of 32. You want the floor as FULL as possible — that's occupancy! An SM sitting half-empty wastes GPU power. Max occupancy = max performance potential.
// ---- KEY NUMBERS (Ampere architecture, RTX 30xx) ----
// Max threads per SM: 2048
// Max warps per SM: 64 (2048 / 32)
// Max blocks per SM: 32
// Max threads per block: 1024
// Warp size: 32
// If we launch blocks of 256 threads each (8 warps per block):
// Blocks per SM = min(32, floor(64 / 8)) = min(32, 8) = 8 blocks
// Active warps = 8 blocks * 8 warps = 64 warps
// Occupancy = 64 / 64 = 100% ← IDEAL!
// If we launch blocks of 32 threads each (1 warp per block):
// Blocks per SM = min(32, 64) = 32
// Active warps = 32 * 1 = 32
// Occupancy = 32 / 64 = 50% ← POOR
// REGISTERS also limit occupancy:
// If kernel uses 64 registers/thread:
// Total registers per SM: 65536
// Threads per SM = 65536 / 64 = 1024 → only 50% occupancy!
__global__ void registerHeavy() {
float a,b,c,d,e,f,g,h; // More registers → fewer threads fit per SM
// ... uses 64+ registers → limits occupancy
}
The three limits on occupancy:
// Using the CUDA occupancy API at runtime:
#include <cuda_runtime.h>
int main() {
int blockSize = 256;
int minGridSize, bestBlockSize;
// Let CUDA suggest optimal block size for your kernel!
cudaOccupancyMaxPotentialBlockSize(
&minGridSize, // min grid size for full occupancy
&bestBlockSize, // optimal block size
myKernel, // your kernel function
0, // dynamic shared memory
0 // max block size (0 = no limit)
);
printf("Best block size: %d\n", bestBlockSize);
printf("Min grid size: %d\n", minGridSize);
// Launch with these values for maximum occupancy!
myKernel<<<minGridSize, bestBlockSize>>>(data, n);
}
📋 Instructions
Calculate occupancy for 3 launch configurations on an SM that supports max 2048 threads, max 64 warps, max 32 blocks:
```
=== GPU Occupancy Calculator ===
SM Capacity: 2048 threads, 64 warps, 32 blocks
[Config 1] Block size: 256 threads (8 warps)
Blocks fitting (warp limit): 64/8 = 8
Blocks fitting (block limit): 32
Active blocks: 8
Active warps: 64
Occupancy: 100%
[Config 2] Block size: 32 threads (1 warp)
Blocks fitting (warp limit): 64/1 = 64 -> capped at 32
Blocks fitting (block limit): 32
Active blocks: 32
Active warps: 32
Occupancy: 50%
[Config 3] Block size: 1024 threads (32 warps)
Blocks fitting (warp limit): 64/32 = 2
Blocks fitting (block limit): 32
Active blocks: 2
Active warps: 64
Occupancy: 100%
Best config: 256 or 1024 threads per block
```
Run the code as-is. Notice that a tiny block size (32) wastes SM capacity because the 32-block cap is hit. A balanced block size like 256 saturates both limits perfectly. Higher occupancy generally means better latency hiding.