🔬 Chapter 10, Part 2: GPU Profiling — See the Truth, Not the Story
💡 Story: A good General never guesses which flank is weak — they send scouts. In GPU programming, the scouts are profilers. NVIDIA provides world-class tools to show exactly where your GPU is spending time, what's bottlenecking, and what to fix. You CANNOT optimize without measuring — profiling is where performance engineering begins.
// Quick profiling commands:
// 1. System timeline view:
nvcc -o app app.cu
nsys profile ./app // Generates .nsys-rep file, open in Nsight Systems GUI
// 2. Kernel-level deep dive:
ncu --set full ./app // Full kernel metrics
ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum ./app // Specific metric
// 3. Command-line quick stats:
nvprof --print-gpu-summary ./app
// Key metrics to check:
// sm__throughput.avg.pct_of_peak_sustained_elapsed → GPU utilization
// l2_read_throughput → Memory bandwidth usage
// achieved_occupancy → Warp occupancy
// warp_execution_efficiency → % time not wasted on divergence
// inst_fp_32 → FP32 instructions per cycle
// Step 1: Profile with Nsight Systems → find WHICH kernel is slow
// Step 2: Profile that kernel with Nsight Compute → WHY it's slow:
// - Compute bound? → optimize math, use Tensor Cores, reduce ops
// - Memory bound? → coalesce, use shared memory, reduce reloading
// - Latency bound? → increase occupancy, hide stalls with more warps
// Step 3: Fix bottleneck → re-profile to verify improvement
// Step 4: Repeat until satisfied
//
// The Roofline Model in Nsight Compute shows:
// • Peak theoretical FLOPs (compute roof)
// • Peak memory bandwidth (memory roof)
// • Where your kernel lives on this chart → tells you WHAT limits it
📋 Instructions
Print a profiling guide showing common bottlenecks and fixes:
```
=== GPU Profiling Guide ===
[Profiling Tools]
nsys profile ./app -> System timeline (start here!)
ncu --set full ./app -> Kernel deep dive
nvprof ./app -> Quick command-line stats
[Common Bottleneck Patterns]
Bottleneck Symptom Fix
Memory bandwidth Low FLOPS util Coalesce, use shared mem
Compute bound High FLOPS util Use Tensor Cores, reduce ops
Latency Low occupancy More threads, reduce registers
Warp divergence Low warp eff Align branches to warp size
PCIe transfer GPU sits idle Async transfers, pinned memory
[Golden Rule]
Measure FIRST, optimize SECOND.
Never guess where the bottleneck is!
```
Run the code to print the profiling cheat sheet. In real GPU projects, always start with Nsight Systems to get the big picture, then drill into specific kernels with Nsight Compute. The Roofline Model is your best friend for knowing whether to optimize compute or memory.