Coalesced Memory Access 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🚀 Chapter 6: Optimization — Make It Blazing Fast!

💡 Story: Your GPU's memory system works like a school bus. The bus picks up 32 passengers (1 warp of threads) at once. If those 32 passengers are at 32 consecutive houses on the same street, the bus makes 1 efficient trip. But if they're scattered all over town — the bus makes 32 separate trips! That's the difference between coalesced and non-coalesced access.

Memory Coalescing — The #1 GPU Optimization:

// ✅ COALESCED (fast!) — Threads access consecutive addresses
__global__ void coalesced(float* arr, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n) arr[i] *= 2.0f;  // Thread 0→arr[0], Thread 1→arr[1], ... ✓
}
// Memory access pattern: [0][1][2][3]...[31] ← ONE memory transaction!

// ❌ NON-COALESCED (slow!) — Threads access strided/scattered addresses  
__global__ void nonCoalesced(float* arr, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i * 2 < n) arr[i * 2] *= 2.0f;  // Thread 0→arr[0], Thread 1→arr[2], ...
}
// Memory access: [0][2][4][6]...[62] ← 32 separate transactions!!
// SLOW: needs 32 transactions vs 1 for coalesced = 32× slower memory!

🏗️ The Rule of Coalescing:

✅ Coalesced — Thread i accesses element i (or i+offset). Sequential threads → sequential addresses
❌ Stride access — Thread i accesses element i*stride → every-other, every-third, etc.
❌ Random access — Each thread accesses a random address → worst case
📏 Hardware unit — Memory is fetched in 128-byte cache lines. The GPU fetches entire cache lines at once

// Classic coalescing example: Matrix operations
//
// Row-major matrix (C style): mat[row][col] = mat[row*width + col]

// ✅ COALESCED: Threads in same warp process same ROW, different COLUMNS
__global__ void rowWise(float* mat, int width) {
    int row = blockIdx.x;
    int col = threadIdx.x;  // Thread 0→col0, Thread 1→col1 (consecutive!)
    mat[row * width + col] *= 2.0f;  // ✓ Coalesced!
}

// ❌ NON-COALESCED: Threads in same warp process same COLUMN, different ROWS
__global__ void colWise(float* mat, int height, int width) {
    int col = blockIdx.x;
    int row = threadIdx.x;  // Thread 0→row0, Thread 1→row1 (STRIDE of width!)
    mat[row * width + col] *= 2.0f;  // ✗ Non-coalesced! Each access is width apart
}

📋 Instructions

Calculate and compare memory access efficiency. For a warp of 32 threads with coalesced vs strided access: ``` === Memory Coalescing Analysis === Warp size: 32 threads Cache line: 128 bytes (32 floats) Coalesced access (stride=1): Thread 0 -> addr 0, Thread 1 -> addr 4, ..., Thread 31 -> addr 124 All 32 addresses fit in 1 cache line! Transactions needed: 1 Efficiency: 100% Strided access (stride=2): Thread 0 -> addr 0, Thread 1 -> addr 8, ..., Thread 31 -> addr 248 Spans 2 cache lines Transactions needed: 2 Efficiency: 50% Strided access (stride=32): Thread 0 -> addr 0, Thread 1 -> addr 128, ..., Thread 31 -> addr 3968 Spans 32 cache lines Transactions needed: 32 Efficiency: 3% ```

Run the provided code. The analyzeAccess function calculates how many cache lines are touched and the memory efficiency percentage. The key insight: stride=1 (coalesced) = 100% efficient; each doubling of stride halves efficiency.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

void analyzeAccess(int stride, int warpSize, int cacheLineBytes, int elemBytes) {
    int spanBytes = (warpSize - 1) * stride * elemBytes + elemBytes;
    int transactions = (spanBytes + cacheLineBytes - 1) / cacheLineBytes;
    int efficiency = (int)((1.0f * warpSize * elemBytes) / (1.0f * transactions * cacheLineBytes) * 100);
    printf("Strided access (stride=%d):\n", stride);
    printf("  Spans %d cache lines\n", transactions);
    printf("  Transactions needed: %d\n", transactions);
    printf("  Efficiency: %d%%\n\n", efficiency);
}

int main() {
    int warpSize = 32, cacheLine = 128, elemSize = 4;
    printf("=== Memory Coalescing Analysis ===\n");
    printf("Warp size: 32 threads\n");
    printf("Cache line: 128 bytes (32 floats)\n\n");
    printf("Coalesced access (stride=1):\n");
    printf("  Thread 0 -> addr 0, Thread 1 -> addr 4, ..., Thread 31 -> addr 124\n");
    printf("  All 32 addresses fit in 1 cache line!\n");
    printf("  Transactions needed: 1\n");
    printf("  Efficiency: 100%%\n\n");
    analyzeAccess(2, warpSize, cacheLine, elemSize);
    analyzeAccess(32, warpSize, cacheLine, elemSize);
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›