🧩 Chapter 8, Part 3: Tiled Matrix Multiply — The King of GPU Algorithms
💡 Story: Your naive soldiers went to the warehouse (global memory) 1 BILLION times for a 1000×1000 matmul — exhausting! Now the smart General orders: instead of fetching individual items, bring a whole 16×16 crate (tile) to the forward base (shared memory). Compute everything using that tile. Then swap crates. 16x fewer warehouse trips, 16x faster!
#define TILE_SIZE 16
__global__ void tiledMatMul(float* A, float* B, float* C, int M, int K, int N) {
__shared__ float A_tile[TILE_SIZE][TILE_SIZE];
__shared__ float B_tile[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
// Loop over tiles — each iteration processes one TILE_SIZE slice of K
for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
// Step 1: LOAD — cooperatively load tile into shared memory
int aCol = t * TILE_SIZE + threadIdx.x; // Column in A to load
int bRow = t * TILE_SIZE + threadIdx.y; // Row in B to load
A_tile[threadIdx.y][threadIdx.x] = (row < M && aCol < K) ? A[row * K + aCol] : 0.f;
B_tile[threadIdx.y][threadIdx.x] = (bRow < K && col < N) ? B[bRow * N + col] : 0.f;
__syncthreads(); // Wait for ALL threads to finish loading!
// Step 2: COMPUTE — dot product using shared memory tiles
for (int k = 0; k < TILE_SIZE; k++)
sum += A_tile[threadIdx.y][k] * B_tile[k][threadIdx.x];
__syncthreads(); // Wait before loading next tile (prevent overwrite!)
}
if (row < M && col < N) C[row * N + col] = sum;
}
Why two __syncthreads() calls?
1️⃣ After LOAD — Ensures all threads finished loading their tile element before ANY thread starts computing
2️⃣ After COMPUTE — Ensures all threads finished computing before the NEXT iteration overwrites the tiles
⚠️ Without sync 1 — Thread (0,0) might compute before thread (15,15) finishes loading → wrong tile values
⚠️ Without sync 2 — Fast threads load the next tile while slow threads still compute from the previous tile!
🚀 Speedup — TILE_SIZE × fewer global memory reads. TILE=16 → 16x less bandwidth needed
📋 Instructions
Simulate tiled matrix multiply for a 4×4 × 4×4 multiplication with TILE_SIZE=2, showing the tile phases:
```
=== Tiled MatMul Simulation ===
Matrix size: 4x4, Tile size: 2
Number of tile phases: 2
Phase 1 (t=0): Load columns 0-1
A_tile = [[1,2],[5,6]] B_tile = [[1,2],[5,6]]
Partial sums computed for C[0][0..3] and C[1][0..3]
Phase 2 (t=1): Load columns 2-3
A_tile = [[3,4],[7,8]] B_tile = [[9,10],[13,14]]
Add to partial sums
Final C[0][0] = 1*1+2*5 + 3*9+4*13 = 11 + 79 = 90
Global reads (naive): 32 per element, 512 total
Global reads (tiled): 4 per element, 64 total
Memory savings: 8.00x
```
With TILE=2 and N=4, you get 2x savings (4 vs 8 global reads per element). With TILE=16 on a 1024×1024 matrix, you'd get 16x savings. The larger the tile, the better — limited only by shared memory size. cuBLAS uses TILE=32+ for highest performance.