CUDA Programming Optimization — Make It Blazing Fast
💡
Exercise 29

Shared Memory Tiling 25 XP Hard

Ctrl+Enter Run Ctrl+S Save

🧩 Chapter 6, Part 4: Tiling — Work Smart, Not Hard

💡 Story: Your army needs to read from a huge warehouse (global memory) far away. Every time a soldier needs a supply, going to the warehouse takes 600 cycles. Smart generals set up a FORWARD BASE (shared memory) — bring supplies in BULK to the forward base, then distribute locally at 80 cycles. That bulk-load strategy is tiling!

// ❌ NAIVE: Each thread loads its own data from global memory __global__ void naiveSum(float* a, float* b, float* out, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i < n) { // Every thread independently reads from global memory out[i] = a[i] + b[i]; // 2 global reads per thread } } // For n=1024, that's 2048 global memory accesses. Fine for simple ops. // But for algorithms that REUSE data (like matrix multiply), this is devastating!
// ✅ TILED: Load tile into shared memory, compute from fast shared memory #define TILE_SIZE 16 __global__ void tiledKernel(float* input, float* output, int n) { // 1. DECLARE shared memory tile __shared__ float tile[TILE_SIZE]; int globalIdx = threadIdx.x + blockIdx.x * blockDim.x; int localIdx = threadIdx.x; // 2. LOAD: Each thread loads ONE element into the tile if (globalIdx < n) tile[localIdx] = input[globalIdx]; // Global → Shared // 3. SYNC: Wait for ALL threads to finish loading __syncthreads(); // 4. COMPUTE: Now use fast shared memory // Example: compute running sum using data from neighbors if (localIdx > 0 && globalIdx < n) { // Read from shared memory (fast!) instead of global memory output[globalIdx] = tile[localIdx] + tile[localIdx - 1]; } else if (globalIdx < n) { output[globalIdx] = tile[localIdx]; } // Threads read tile[localIdx-1] which another thread wrote! // Without __syncthreads(), that data might not be there yet! }

The 4-step tiling recipe:

  • 1️⃣ DECLARE — `__shared__ float tile[TILE_SIZE]` → allocate forward base
  • 2️⃣ LOAD — Each thread loads 1 element from global → shared memory
  • 3️⃣ SYNC — `__syncthreads()` → wait for ALL soldiers to reach the base
  • 4️⃣ COMPUTE — Do all calculations using fast shared memory
  • 🔄 Repeat — For large data, loop over tiles: load tile 1 → compute → load tile 2 → compute
// REAL POWER: Tiled prefix sum over large array __global__ void tiledPrefixSum(float* data, float* out, int n) { __shared__ float tile[256]; int gid = threadIdx.x + blockIdx.x * blockDim.x; int tid = threadIdx.x; // Load tile tile[tid] = (gid < n) ? data[gid] : 0.0f; __syncthreads(); // Parallel reduction within tile for (int stride = 1; stride < blockDim.x; stride *= 2) { float val = 0; if (tid >= stride) val = tile[tid - stride]; __syncthreads(); // Wait before modifying tile! tile[tid] += val; __syncthreads(); // Wait after modifying tile! } if (gid < n) out[gid] = tile[tid]; }
📋 Instructions
Simulate tiled vs non-tiled stencil computation (each output = avg of self + 2 neighbors). Show memory access counts: ``` === Tiling Simulation: Stencil Computation === Array size: 8 elements, Tile size: 4 --- Non-Tiled (naive) --- Each element reads 3 values from global memory Total global reads: 24 Global memory efficiency: 1.00x (baseline) --- Tiled --- Tile 1 [0-3]: Load 4 from global, compute 4 outputs Tile 2 [4-7]: Load 4 from global, compute 4 outputs Total global reads: 8 Shared memory reads: 24 Global memory reduction: 3.00x fewer global accesses! ```
Run the code as-is. Key insight: tiling reduces GLOBAL memory accesses dramatically by reusing shared memory. A stencil that touches 3 values per output = 3x fewer global reads when tiled. For matrix multiplication, tiling gives up to TILE_SIZE times fewer global reads — that's why tiled GEMM is so fast!
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.