Registers & Local Memory 15 XP Easy

Ctrl+Enter Run Ctrl+S Save

⚡ Chapter 4, Part 3: Registers — The Fastest Memory of All

💡 Story: Every soldier carries dog tags — small info they need right NOW (their name, blood type). That's registers! They're the fastest memory, but you can only carry a few. If you carry too many things, some go into a backpack (local memory) which is much slower.

Registers:

⚡ Fastest — Zero-latency access (no cache miss possible!)
🔒 Private — Each thread has its OWN registers (NOT shared with other threads)
📏 Limited — Typically 255 registers per thread on modern GPUs
🔧 Automatic — Local variables in kernels automatically become registers
💡 Invisible — You don't explicitly declare registers; the compiler handles it

__global__ void registerExample(float* in, float* out, int n) {
    // These are REGISTERS — private to each thread, ultra fast!
    int   i      = threadIdx.x + blockIdx.x * blockDim.x;  // Register
    float myVal  = in[i];          // Register — loaded once from slow global mem
    float result = myVal * myVal;  // Register arithmetic — instant!
    float bias   = 0.5f;           // Register
    
    // Multiple operations on REGISTERS — all ultra fast:
    result = result + bias;
    result = result * 2.0f;
    result = result - 1.0f;
    
    if (i < n) out[i] = result;    // ONE write to global memory at the end
    // Rule: Load once, compute many times in registers, store once
}

Register Spilling — When registers overflow:

😱 Register spill — If a thread uses more registers than available, excess goes to 'local memory'
💾 Local memory — Physically stored in GLOBAL memory (slow!) but local to one thread
🐌 Impact — Spilling kills performance (like going from pocket → luggage on a plane)
🔧 Detect — Use nvcc --ptxas-info to see register usage
✅ Fix — Reduce variables per thread, use smaller data types

📊 CUDA Memory Speed Comparison:

🥇 Registers — ~1 cycle latency — fastest
🥈 Shared Memory — ~5-32 cycles latency
🥉 L1 Cache — ~20-40 cycles latency
4️⃣ L2 Cache — ~100-200 cycles latency
5️⃣ Global Memory — ~400-800 cycles latency — SLOWEST

📋 Instructions

Each line follows the pattern: printf("Rank N: Name | Latency: ~X cycles | Size: ...\n"); Add one printf for each rank 2-6.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›