Shared Memory Tiling 25 XP Hard

Ctrl+Enter Run Ctrl+S Save

🧩 Chapter 6, Part 4: Tiling — Work Smart, Not Hard

💡 Story: Your army needs to read from a huge warehouse (global memory) far away. Every time a soldier needs a supply, going to the warehouse takes 600 cycles. Smart generals set up a FORWARD BASE (shared memory) — bring supplies in BULK to the forward base, then distribute locally at 80 cycles. That bulk-load strategy is tiling!

// ❌ NAIVE: Each thread loads its own data from global memory
__global__ void naiveSum(float* a, float* b, float* out, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n) {
        // Every thread independently reads from global memory
        out[i] = a[i] + b[i];  // 2 global reads per thread
    }
}
// For n=1024, that's 2048 global memory accesses. Fine for simple ops.
// But for algorithms that REUSE data (like matrix multiply), this is devastating!

// ✅ TILED: Load tile into shared memory, compute from fast shared memory
#define TILE_SIZE 16

__global__ void tiledKernel(float* input, float* output, int n) {
    // 1. DECLARE shared memory tile
    __shared__ float tile[TILE_SIZE];
    
    int globalIdx = threadIdx.x + blockIdx.x * blockDim.x;
    int localIdx  = threadIdx.x;
    
    // 2. LOAD: Each thread loads ONE element into the tile
    if (globalIdx < n)
        tile[localIdx] = input[globalIdx];  // Global → Shared
    
    // 3. SYNC: Wait for ALL threads to finish loading
    __syncthreads();
    
    // 4. COMPUTE: Now use fast shared memory
    // Example: compute running sum using data from neighbors
    if (localIdx > 0 && globalIdx < n) {
        // Read from shared memory (fast!) instead of global memory
        output[globalIdx] = tile[localIdx] + tile[localIdx - 1];
    } else if (globalIdx < n) {
        output[globalIdx] = tile[localIdx];
    }
    // Threads read tile[localIdx-1] which another thread wrote!
    // Without __syncthreads(), that data might not be there yet!
}

The 4-step tiling recipe:

1️⃣ DECLARE — `__shared__ float tile[TILE_SIZE]` → allocate forward base
2️⃣ LOAD — Each thread loads 1 element from global → shared memory
3️⃣ SYNC — `__syncthreads()` → wait for ALL soldiers to reach the base
4️⃣ COMPUTE — Do all calculations using fast shared memory
🔄 Repeat — For large data, loop over tiles: load tile 1 → compute → load tile 2 → compute

// REAL POWER: Tiled prefix sum over large array
__global__ void tiledPrefixSum(float* data, float* out, int n) {
    __shared__ float tile[256];
    int gid = threadIdx.x + blockIdx.x * blockDim.x;
    int tid = threadIdx.x;
    
    // Load tile
    tile[tid] = (gid < n) ? data[gid] : 0.0f;
    __syncthreads();
    
    // Parallel reduction within tile
    for (int stride = 1; stride < blockDim.x; stride *= 2) {
        float val = 0;
        if (tid >= stride) val = tile[tid - stride];
        __syncthreads();   // Wait before modifying tile!
        tile[tid] += val;
        __syncthreads();   // Wait after modifying tile!
    }
    
    if (gid < n) out[gid] = tile[tid];
}

📋 Instructions

Simulate tiled vs non-tiled stencil computation (each output = avg of self + 2 neighbors). Show memory access counts: ``` === Tiling Simulation: Stencil Computation === Array size: 8 elements, Tile size: 4 --- Non-Tiled (naive) --- Each element reads 3 values from global memory Total global reads: 24 Global memory efficiency: 1.00x (baseline) --- Tiled --- Tile 1 [0-3]: Load 4 from global, compute 4 outputs Tile 2 [4-7]: Load 4 from global, compute 4 outputs Total global reads: 8 Shared memory reads: 24 Global memory reduction: 3.00x fewer global accesses! ```

Run the code as-is. Key insight: tiling reduces GLOBAL memory accesses dramatically by reusing shared memory. A stencil that touches 3 values per output = 3x fewer global reads when tiled. For matrix multiplication, tiling gives up to TILE_SIZE times fewer global reads — that's why tiled GEMM is so fast!

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

int main() {
    int n = 8, tileSize = 4;
    printf("=== Tiling Simulation: Stencil Computation ===\n");
    printf("Array size: %d elements, Tile size: %d\n\n", n, tileSize);
    
    // Non-tiled: each output reads 3 neighbors from global memory
    printf("--- Non-Tiled (naive) ---\n");
    int naiveReads = n * 3; // self + 2 neighbors, all from global
    printf("Each element reads 3 values from global memory\n");
    printf("Total global reads: %d\n", naiveReads);
    printf("Global memory efficiency: 1.00x (baseline)\n\n");
    
    // Tiled: load whole tile once, then compute from shared memory
    printf("--- Tiled ---\n");
    int numTiles = n / tileSize;
    for (int t = 0; t < numTiles; t++) {
        int start = t * tileSize;
        int end = start + tileSize - 1;
        printf("Tile %d [%d-%d]: Load %d from global, compute %d outputs\n",
               t + 1, start, end, tileSize, tileSize);
    }
    int tiledGlobalReads = n; // one load per element
    int sharedReads = n * 3; // stencil reads from shared memory
    printf("Total global reads: %d\n", tiledGlobalReads);
    printf("Shared memory reads: %d\n", sharedReads);
    
    float reduction = (float)naiveReads / tiledGlobalReads;
    printf("\nGlobal memory reduction: %.2fx fewer global accesses!\n", reduction);
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›