GPU Profiling with Nsight 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🔬 Chapter 10, Part 2: GPU Profiling — See the Truth, Not the Story

💡 Story: A good General never guesses which flank is weak — they send scouts. In GPU programming, the scouts are profilers. NVIDIA provides world-class tools to show exactly where your GPU is spending time, what's bottlenecking, and what to fix. You CANNOT optimize without measuring — profiling is where performance engineering begins.

NVIDIA Profiling Tools:

🔭 Nsight Systems — System-level profiler. Shows CPU/GPU timeline, stream concurrency, memory transfers vs kernel execution. Start here!
🔬 Nsight Compute — Kernel-level profiler. Shows roofline analysis, memory bandwidth, compute utilization, warp efficiency
🧮 nvprof (legacy) — Command-line profiler for quick metrics: `nvprof ./my_program`
📊 CUDA Occupancy Calculator — Spreadsheet tool to compute occupancy for your kernel parameters

// Quick profiling commands:
// 1. System timeline view:
nvcc -o app app.cu
nsys profile ./app          // Generates .nsys-rep file, open in Nsight Systems GUI

// 2. Kernel-level deep dive:
ncu --set full ./app         // Full kernel metrics
ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum ./app  // Specific metric

// 3. Command-line quick stats:
nvprof --print-gpu-summary ./app

// Key metrics to check:
// sm__throughput.avg.pct_of_peak_sustained_elapsed  → GPU utilization
// l2_read_throughput                                 → Memory bandwidth usage
// achieved_occupancy                                 → Warp occupancy
// warp_execution_efficiency                          → % time not wasted on divergence
// inst_fp_32                                         → FP32 instructions per cycle

The profiling workflow:

// Step 1: Profile with Nsight Systems → find WHICH kernel is slow
// Step 2: Profile that kernel with Nsight Compute → WHY it's slow:
//   - Compute bound? → optimize math, use Tensor Cores, reduce ops
//   - Memory bound? → coalesce, use shared memory, reduce reloading
//   - Latency bound? → increase occupancy, hide stalls with more warps
// Step 3: Fix bottleneck → re-profile to verify improvement
// Step 4: Repeat until satisfied
//
// The Roofline Model in Nsight Compute shows:
//   • Peak theoretical FLOPs (compute roof)
//   • Peak memory bandwidth (memory roof)
//   • Where your kernel lives on this chart → tells you WHAT limits it

📋 Instructions

Print a profiling guide showing common bottlenecks and fixes: ``` === GPU Profiling Guide === [Profiling Tools] nsys profile ./app -> System timeline (start here!) ncu --set full ./app -> Kernel deep dive nvprof ./app -> Quick command-line stats [Common Bottleneck Patterns] Bottleneck Symptom Fix Memory bandwidth Low FLOPS util Coalesce, use shared mem Compute bound High FLOPS util Use Tensor Cores, reduce ops Latency Low occupancy More threads, reduce registers Warp divergence Low warp eff Align branches to warp size PCIe transfer GPU sits idle Async transfers, pinned memory [Golden Rule] Measure FIRST, optimize SECOND. Never guess where the bottleneck is! ```

Run the code to print the profiling cheat sheet. In real GPU projects, always start with Nsight Systems to get the big picture, then drill into specific kernels with Nsight Compute. The Roofline Model is your best friend for knowing whether to optimize compute or memory.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

int main() {
    printf("=== GPU Profiling Guide ===\n\n");
    
    printf("[Profiling Tools]\n");
    printf("nsys profile ./app      -> System timeline (start here!)\n");
    printf("ncu --set full ./app    -> Kernel deep dive\n");
    printf("nvprof ./app            -> Quick command-line stats\n\n");
    
    printf("[Common Bottleneck Patterns]\n");
    printf("%-20s %-22s %s\n", "Bottleneck", "Symptom", "Fix");
    printf("%-20s %-22s %s\n", "Memory bandwidth", "Low FLOPS util", "Coalesce, use shared mem");
    printf("%-20s %-22s %s\n", "Compute bound", "High FLOPS util", "Use Tensor Cores, reduce ops");
    printf("%-20s %-22s %s\n", "Latency", "Low occupancy", "More threads, reduce registers");
    printf("%-20s %-22s %s\n", "Warp divergence", "Low warp eff", "Align branches to warp size");
    printf("%-20s %-22s %s\n", "PCIe transfer", "GPU sits idle", "Async transfers, pinned memory");
    
    printf("\n[Golden Rule]\n");
    printf("Measure FIRST, optimize SECOND.\n");
    printf("Never guess where the bottleneck is!\n");
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›