CUDA Programming GPU Memory — The Treasure Map
💡
Exercise 18

Registers & Local Memory 15 XP Easy

Ctrl+Enter Run Ctrl+S Save

Chapter 4, Part 3: Registers — The Fastest Memory of All

💡 Story: Every soldier carries dog tags — small info they need right NOW (their name, blood type). That's registers! They're the fastest memory, but you can only carry a few. If you carry too many things, some go into a backpack (local memory) which is much slower.

Registers:

  • Fastest — Zero-latency access (no cache miss possible!)
  • 🔒 Private — Each thread has its OWN registers (NOT shared with other threads)
  • 📏 Limited — Typically 255 registers per thread on modern GPUs
  • 🔧 Automatic — Local variables in kernels automatically become registers
  • 💡 Invisible — You don't explicitly declare registers; the compiler handles it
__global__ void registerExample(float* in, float* out, int n) { // These are REGISTERS — private to each thread, ultra fast! int i = threadIdx.x + blockIdx.x * blockDim.x; // Register float myVal = in[i]; // Register — loaded once from slow global mem float result = myVal * myVal; // Register arithmetic — instant! float bias = 0.5f; // Register // Multiple operations on REGISTERS — all ultra fast: result = result + bias; result = result * 2.0f; result = result - 1.0f; if (i < n) out[i] = result; // ONE write to global memory at the end // Rule: Load once, compute many times in registers, store once }

Register Spilling — When registers overflow:

  • 😱 Register spill — If a thread uses more registers than available, excess goes to 'local memory'
  • 💾 Local memory — Physically stored in GLOBAL memory (slow!) but local to one thread
  • 🐌 Impact — Spilling kills performance (like going from pocket → luggage on a plane)
  • 🔧 Detect — Use nvcc --ptxas-info to see register usage
  • Fix — Reduce variables per thread, use smaller data types

📊 CUDA Memory Speed Comparison:

  • 🥇 Registers — ~1 cycle latency — fastest
  • 🥈 Shared Memory — ~5-32 cycles latency
  • 🥉 L1 Cache — ~20-40 cycles latency
  • 4️⃣ L2 Cache — ~100-200 cycles latency
  • 5️⃣ Global Memory — ~400-800 cycles latency — SLOWEST
📋 Instructions
Print the GPU memory hierarchy from fastest to slowest with latency values: ``` === GPU Memory Hierarchy (Fastest → Slowest) === Rank 1: Registers | Latency: ~1 cycle | Size: 256KB/SM Rank 2: Shared Memory | Latency: ~32 cycles | Size: 48-96KB/SM Rank 3: L1 Cache | Latency: ~40 cycles | Size: 32-128KB/SM Rank 4: L2 Cache | Latency: ~200 cycles | Size: 4-80MB Rank 5: Global Memory | Latency: ~800 cycles | Size: 8-80GB Rank 6: CPU RAM | Latency: ~5000 cycles| Size: 32-512GB ```
Each line follows the pattern: printf("Rank N: Name | Latency: ~X cycles | Size: ...\n"); Add one printf for each rank 2-6.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.