🚀 Chapter 6: Optimization — Make It Blazing Fast!
💡 Story: Your GPU's memory system works like a school bus. The bus picks up 32 passengers (1 warp of threads) at once. If those 32 passengers are at 32 consecutive houses on the same street, the bus makes 1 efficient trip. But if they're scattered all over town — the bus makes 32 separate trips! That's the difference between coalesced and non-coalesced access.
Memory Coalescing — The #1 GPU Optimization:
// ✅ COALESCED (fast!) — Threads access consecutive addresses
__global__ void coalesced(float* arr, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n) arr[i] *= 2.0f; // Thread 0→arr[0], Thread 1→arr[1], ... ✓
}
// Memory access pattern: [0][1][2][3]...[31] ← ONE memory transaction!
// ❌ NON-COALESCED (slow!) — Threads access strided/scattered addresses
__global__ void nonCoalesced(float* arr, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i * 2 < n) arr[i * 2] *= 2.0f; // Thread 0→arr[0], Thread 1→arr[2], ...
}
// Memory access: [0][2][4][6]...[62] ← 32 separate transactions!!
// SLOW: needs 32 transactions vs 1 for coalesced = 32× slower memory!
🏗️ The Rule of Coalescing:
// Classic coalescing example: Matrix operations
//
// Row-major matrix (C style): mat[row][col] = mat[row*width + col]
// ✅ COALESCED: Threads in same warp process same ROW, different COLUMNS
__global__ void rowWise(float* mat, int width) {
int row = blockIdx.x;
int col = threadIdx.x; // Thread 0→col0, Thread 1→col1 (consecutive!)
mat[row * width + col] *= 2.0f; // ✓ Coalesced!
}
// ❌ NON-COALESCED: Threads in same warp process same COLUMN, different ROWS
__global__ void colWise(float* mat, int height, int width) {
int col = blockIdx.x;
int row = threadIdx.x; // Thread 0→row0, Thread 1→row1 (STRIDE of width!)
mat[row * width + col] *= 2.0f; // ✗ Non-coalesced! Each access is width apart
}
📋 Instructions
Calculate and compare memory access efficiency. For a warp of 32 threads with coalesced vs strided access:
```
=== Memory Coalescing Analysis ===
Warp size: 32 threads
Cache line: 128 bytes (32 floats)
Coalesced access (stride=1):
Thread 0 -> addr 0, Thread 1 -> addr 4, ..., Thread 31 -> addr 124
All 32 addresses fit in 1 cache line!
Transactions needed: 1
Efficiency: 100%
Strided access (stride=2):
Thread 0 -> addr 0, Thread 1 -> addr 8, ..., Thread 31 -> addr 248
Spans 2 cache lines
Transactions needed: 2
Efficiency: 50%
Strided access (stride=32):
Thread 0 -> addr 0, Thread 1 -> addr 128, ..., Thread 31 -> addr 3968
Spans 32 cache lines
Transactions needed: 32
Efficiency: 3%
```
Run the provided code. The analyzeAccess function calculates how many cache lines are touched and the memory efficiency percentage. The key insight: stride=1 (coalesced) = 100% efficient; each doubling of stride halves efficiency.