⚡ Chapter 4, Part 3: Registers — The Fastest Memory of All
💡 Story: Every soldier carries dog tags — small info they need right NOW (their name, blood type). That's registers! They're the fastest memory, but you can only carry a few. If you carry too many things, some go into a backpack (local memory) which is much slower.
__global__ void registerExample(float* in, float* out, int n) {
// These are REGISTERS — private to each thread, ultra fast!
int i = threadIdx.x + blockIdx.x * blockDim.x; // Register
float myVal = in[i]; // Register — loaded once from slow global mem
float result = myVal * myVal; // Register arithmetic — instant!
float bias = 0.5f; // Register
// Multiple operations on REGISTERS — all ultra fast:
result = result + bias;
result = result * 2.0f;
result = result - 1.0f;
if (i < n) out[i] = result; // ONE write to global memory at the end
// Rule: Load once, compute many times in registers, store once
}
Register Spilling — When registers overflow:
📊 CUDA Memory Speed Comparison:
📋 Instructions
Print the GPU memory hierarchy from fastest to slowest with latency values:
```
=== GPU Memory Hierarchy (Fastest → Slowest) ===
Rank 1: Registers | Latency: ~1 cycle | Size: 256KB/SM
Rank 2: Shared Memory | Latency: ~32 cycles | Size: 48-96KB/SM
Rank 3: L1 Cache | Latency: ~40 cycles | Size: 32-128KB/SM
Rank 4: L2 Cache | Latency: ~200 cycles | Size: 4-80MB
Rank 5: Global Memory | Latency: ~800 cycles | Size: 8-80GB
Rank 6: CPU RAM | Latency: ~5000 cycles| Size: 32-512GB
```
Each line follows the pattern: printf("Rank N: Name | Latency: ~X cycles | Size: ...\n"); Add one printf for each rank 2-6.