CUDA Programming Matrix Operations — The Core of AI
💡
Exercise 38

Tiled Matrix Multiply 30 XP Hard

Ctrl+Enter Run Ctrl+S Save

🧩 Chapter 8, Part 3: Tiled Matrix Multiply — The King of GPU Algorithms

💡 Story: Your naive soldiers went to the warehouse (global memory) 1 BILLION times for a 1000×1000 matmul — exhausting! Now the smart General orders: instead of fetching individual items, bring a whole 16×16 crate (tile) to the forward base (shared memory). Compute everything using that tile. Then swap crates. 16x fewer warehouse trips, 16x faster!

#define TILE_SIZE 16 __global__ void tiledMatMul(float* A, float* B, float* C, int M, int K, int N) { __shared__ float A_tile[TILE_SIZE][TILE_SIZE]; __shared__ float B_tile[TILE_SIZE][TILE_SIZE]; int row = blockIdx.y * TILE_SIZE + threadIdx.y; int col = blockIdx.x * TILE_SIZE + threadIdx.x; float sum = 0.0f; // Loop over tiles — each iteration processes one TILE_SIZE slice of K for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) { // Step 1: LOAD — cooperatively load tile into shared memory int aCol = t * TILE_SIZE + threadIdx.x; // Column in A to load int bRow = t * TILE_SIZE + threadIdx.y; // Row in B to load A_tile[threadIdx.y][threadIdx.x] = (row < M && aCol < K) ? A[row * K + aCol] : 0.f; B_tile[threadIdx.y][threadIdx.x] = (bRow < K && col < N) ? B[bRow * N + col] : 0.f; __syncthreads(); // Wait for ALL threads to finish loading! // Step 2: COMPUTE — dot product using shared memory tiles for (int k = 0; k < TILE_SIZE; k++) sum += A_tile[threadIdx.y][k] * B_tile[k][threadIdx.x]; __syncthreads(); // Wait before loading next tile (prevent overwrite!) } if (row < M && col < N) C[row * N + col] = sum; }

Why two __syncthreads() calls?

  • 1️⃣ After LOAD — Ensures all threads finished loading their tile element before ANY thread starts computing
  • 2️⃣ After COMPUTE — Ensures all threads finished computing before the NEXT iteration overwrites the tiles
  • ⚠️ Without sync 1 — Thread (0,0) might compute before thread (15,15) finishes loading → wrong tile values
  • ⚠️ Without sync 2 — Fast threads load the next tile while slow threads still compute from the previous tile!
  • 🚀 Speedup — TILE_SIZE × fewer global memory reads. TILE=16 → 16x less bandwidth needed
📋 Instructions
Simulate tiled matrix multiply for a 4×4 × 4×4 multiplication with TILE_SIZE=2, showing the tile phases: ``` === Tiled MatMul Simulation === Matrix size: 4x4, Tile size: 2 Number of tile phases: 2 Phase 1 (t=0): Load columns 0-1 A_tile = [[1,2],[5,6]] B_tile = [[1,2],[5,6]] Partial sums computed for C[0][0..3] and C[1][0..3] Phase 2 (t=1): Load columns 2-3 A_tile = [[3,4],[7,8]] B_tile = [[9,10],[13,14]] Add to partial sums Final C[0][0] = 1*1+2*5 + 3*9+4*13 = 11 + 79 = 90 Global reads (naive): 32 per element, 512 total Global reads (tiled): 4 per element, 64 total Memory savings: 8.00x ```
With TILE=2 and N=4, you get 2x savings (4 vs 8 global reads per element). With TILE=16 on a 1024×1024 matrix, you'd get 16x savings. The larger the tile, the better — limited only by shared memory size. cuBLAS uses TILE=32+ for highest performance.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.