CUDA Programming Optimization — Make It Blazing Fast
💡
Exercise 26

Coalesced Memory Access 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🚀 Chapter 6: Optimization — Make It Blazing Fast!

💡 Story: Your GPU's memory system works like a school bus. The bus picks up 32 passengers (1 warp of threads) at once. If those 32 passengers are at 32 consecutive houses on the same street, the bus makes 1 efficient trip. But if they're scattered all over town — the bus makes 32 separate trips! That's the difference between coalesced and non-coalesced access.

Memory Coalescing — The #1 GPU Optimization:

// ✅ COALESCED (fast!) — Threads access consecutive addresses __global__ void coalesced(float* arr, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i < n) arr[i] *= 2.0f; // Thread 0→arr[0], Thread 1→arr[1], ... ✓ } // Memory access pattern: [0][1][2][3]...[31] ← ONE memory transaction! // ❌ NON-COALESCED (slow!) — Threads access strided/scattered addresses __global__ void nonCoalesced(float* arr, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i * 2 < n) arr[i * 2] *= 2.0f; // Thread 0→arr[0], Thread 1→arr[2], ... } // Memory access: [0][2][4][6]...[62] ← 32 separate transactions!! // SLOW: needs 32 transactions vs 1 for coalesced = 32× slower memory!

🏗️ The Rule of Coalescing:

  • Coalesced — Thread i accesses element i (or i+offset). Sequential threads → sequential addresses
  • Stride access — Thread i accesses element i*stride → every-other, every-third, etc.
  • Random access — Each thread accesses a random address → worst case
  • 📏 Hardware unit — Memory is fetched in 128-byte cache lines. The GPU fetches entire cache lines at once
// Classic coalescing example: Matrix operations // // Row-major matrix (C style): mat[row][col] = mat[row*width + col] // ✅ COALESCED: Threads in same warp process same ROW, different COLUMNS __global__ void rowWise(float* mat, int width) { int row = blockIdx.x; int col = threadIdx.x; // Thread 0→col0, Thread 1→col1 (consecutive!) mat[row * width + col] *= 2.0f; // ✓ Coalesced! } // ❌ NON-COALESCED: Threads in same warp process same COLUMN, different ROWS __global__ void colWise(float* mat, int height, int width) { int col = blockIdx.x; int row = threadIdx.x; // Thread 0→row0, Thread 1→row1 (STRIDE of width!) mat[row * width + col] *= 2.0f; // ✗ Non-coalesced! Each access is width apart }
📋 Instructions
Calculate and compare memory access efficiency. For a warp of 32 threads with coalesced vs strided access: ``` === Memory Coalescing Analysis === Warp size: 32 threads Cache line: 128 bytes (32 floats) Coalesced access (stride=1): Thread 0 -> addr 0, Thread 1 -> addr 4, ..., Thread 31 -> addr 124 All 32 addresses fit in 1 cache line! Transactions needed: 1 Efficiency: 100% Strided access (stride=2): Thread 0 -> addr 0, Thread 1 -> addr 8, ..., Thread 31 -> addr 248 Spans 2 cache lines Transactions needed: 2 Efficiency: 50% Strided access (stride=32): Thread 0 -> addr 0, Thread 1 -> addr 128, ..., Thread 31 -> addr 3968 Spans 32 cache lines Transactions needed: 32 Efficiency: 3% ```
Run the provided code. The analyzeAccess function calculates how many cache lines are touched and the memory efficiency percentage. The key insight: stride=1 (coalesced) = 100% efficient; each doubling of stride halves efficiency.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.