🧩 Chapter 6, Part 4: Tiling — Work Smart, Not Hard
💡 Story: Your army needs to read from a huge warehouse (global memory) far away. Every time a soldier needs a supply, going to the warehouse takes 600 cycles. Smart generals set up a FORWARD BASE (shared memory) — bring supplies in BULK to the forward base, then distribute locally at 80 cycles. That bulk-load strategy is tiling!
// ❌ NAIVE: Each thread loads its own data from global memory
__global__ void naiveSum(float* a, float* b, float* out, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n) {
// Every thread independently reads from global memory
out[i] = a[i] + b[i]; // 2 global reads per thread
}
}
// For n=1024, that's 2048 global memory accesses. Fine for simple ops.
// But for algorithms that REUSE data (like matrix multiply), this is devastating!
// ✅ TILED: Load tile into shared memory, compute from fast shared memory
#define TILE_SIZE 16
__global__ void tiledKernel(float* input, float* output, int n) {
// 1. DECLARE shared memory tile
__shared__ float tile[TILE_SIZE];
int globalIdx = threadIdx.x + blockIdx.x * blockDim.x;
int localIdx = threadIdx.x;
// 2. LOAD: Each thread loads ONE element into the tile
if (globalIdx < n)
tile[localIdx] = input[globalIdx]; // Global → Shared
// 3. SYNC: Wait for ALL threads to finish loading
__syncthreads();
// 4. COMPUTE: Now use fast shared memory
// Example: compute running sum using data from neighbors
if (localIdx > 0 && globalIdx < n) {
// Read from shared memory (fast!) instead of global memory
output[globalIdx] = tile[localIdx] + tile[localIdx - 1];
} else if (globalIdx < n) {
output[globalIdx] = tile[localIdx];
}
// Threads read tile[localIdx-1] which another thread wrote!
// Without __syncthreads(), that data might not be there yet!
}
The 4-step tiling recipe:
// REAL POWER: Tiled prefix sum over large array
__global__ void tiledPrefixSum(float* data, float* out, int n) {
__shared__ float tile[256];
int gid = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x;
// Load tile
tile[tid] = (gid < n) ? data[gid] : 0.0f;
__syncthreads();
// Parallel reduction within tile
for (int stride = 1; stride < blockDim.x; stride *= 2) {
float val = 0;
if (tid >= stride) val = tile[tid - stride];
__syncthreads(); // Wait before modifying tile!
tile[tid] += val;
__syncthreads(); // Wait after modifying tile!
}
if (gid < n) out[gid] = tile[tid];
}
📋 Instructions
Simulate tiled vs non-tiled stencil computation (each output = avg of self + 2 neighbors). Show memory access counts:
```
=== Tiling Simulation: Stencil Computation ===
Array size: 8 elements, Tile size: 4
--- Non-Tiled (naive) ---
Each element reads 3 values from global memory
Total global reads: 24
Global memory efficiency: 1.00x (baseline)
--- Tiled ---
Tile 1 [0-3]: Load 4 from global, compute 4 outputs
Tile 2 [4-7]: Load 4 from global, compute 4 outputs
Total global reads: 8
Shared memory reads: 24
Global memory reduction: 3.00x fewer global accesses!
```
Run the code as-is. Key insight: tiling reduces GLOBAL memory accesses dramatically by reusing shared memory. A stencil that touches 3 values per output = 3x fewer global reads when tiled. For matrix multiplication, tiling gives up to TILE_SIZE times fewer global reads — that's why tiled GEMM is so fast!