Tiled Matrix Multiply 30 XP Hard

Ctrl+Enter Run Ctrl+S Save

🧩 Chapter 8, Part 3: Tiled Matrix Multiply — The King of GPU Algorithms

💡 Story: Your naive soldiers went to the warehouse (global memory) 1 BILLION times for a 1000×1000 matmul — exhausting! Now the smart General orders: instead of fetching individual items, bring a whole 16×16 crate (tile) to the forward base (shared memory). Compute everything using that tile. Then swap crates. 16x fewer warehouse trips, 16x faster!

#define TILE_SIZE 16

__global__ void tiledMatMul(float* A, float* B, float* C, int M, int K, int N) {
    __shared__ float A_tile[TILE_SIZE][TILE_SIZE];
    __shared__ float B_tile[TILE_SIZE][TILE_SIZE];
    
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;
    
    // Loop over tiles — each iteration processes one TILE_SIZE slice of K
    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        
        // Step 1: LOAD — cooperatively load tile into shared memory
        int aCol = t * TILE_SIZE + threadIdx.x;   // Column in A to load
        int bRow = t * TILE_SIZE + threadIdx.y;   // Row in B to load
        
        A_tile[threadIdx.y][threadIdx.x] = (row < M && aCol < K) ? A[row * K + aCol] : 0.f;
        B_tile[threadIdx.y][threadIdx.x] = (bRow < K && col < N) ? B[bRow * N + col]  : 0.f;
        
        __syncthreads();  // Wait for ALL threads to finish loading!
        
        // Step 2: COMPUTE — dot product using shared memory tiles
        for (int k = 0; k < TILE_SIZE; k++)
            sum += A_tile[threadIdx.y][k] * B_tile[k][threadIdx.x];
        
        __syncthreads();  // Wait before loading next tile (prevent overwrite!)
    }
    
    if (row < M && col < N) C[row * N + col] = sum;
}

Why two __syncthreads() calls?

1️⃣ After LOAD — Ensures all threads finished loading their tile element before ANY thread starts computing
2️⃣ After COMPUTE — Ensures all threads finished computing before the NEXT iteration overwrites the tiles
⚠️ Without sync 1 — Thread (0,0) might compute before thread (15,15) finishes loading → wrong tile values
⚠️ Without sync 2 — Fast threads load the next tile while slow threads still compute from the previous tile!
🚀 Speedup — TILE_SIZE × fewer global memory reads. TILE=16 → 16x less bandwidth needed

📋 Instructions

Simulate tiled matrix multiply for a 4×4 × 4×4 multiplication with TILE_SIZE=2, showing the tile phases: ``` === Tiled MatMul Simulation === Matrix size: 4x4, Tile size: 2 Number of tile phases: 2 Phase 1 (t=0): Load columns 0-1 A_tile = [[1,2],[5,6]] B_tile = [[1,2],[5,6]] Partial sums computed for C[0][0..3] and C[1][0..3] Phase 2 (t=1): Load columns 2-3 A_tile = [[3,4],[7,8]] B_tile = [[9,10],[13,14]] Add to partial sums Final C[0][0] = 1*1+2*5 + 3*9+4*13 = 11 + 79 = 90 Global reads (naive): 32 per element, 512 total Global reads (tiled): 4 per element, 64 total Memory savings: 8.00x ```

With TILE=2 and N=4, you get 2x savings (4 vs 8 global reads per element). With TILE=16 on a 1024×1024 matrix, you'd get 16x savings. The larger the tile, the better — limited only by shared memory size. cuBLAS uses TILE=32+ for highest performance.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

#define N 4
#define TILE 2

int main() {
    int A[N][N] = {{1,2,3,4},{5,6,7,8},{9,10,11,12},{13,14,15,16}};
    int B[N][N] = {{1,2,3,4},{5,6,7,8},{9,10,11,12},{13,14,15,16}};
    int C[N][N] = {0};
    
    int numPhases = N / TILE;
    printf("=== Tiled MatMul Simulation ===\n");
    printf("Matrix size: %dx%d, Tile size: %d\n", N, N, TILE);
    printf("Number of tile phases: %d\n\n", numPhases);
    
    for (int t = 0; t < numPhases; t++) {
        printf("Phase %d (t=%d): Load columns %d-%d\n", t+1, t, t*TILE, t*TILE+TILE-1);
        printf("  A_tile = [[%d,%d],[%d,%d]]  B_tile = [[%d,%d],[%d,%d]]\n",
               A[0][t*TILE],A[0][t*TILE+1], A[1][t*TILE],A[1][t*TILE+1],
               B[t*TILE][0],B[t*TILE][1], B[t*TILE+1][0],B[t*TILE+1][1]);
        if (t == 0) printf("  Partial sums computed for C[0][0..3] and C[1][0..3]\n");
        else printf("  Add to partial sums\n");
        printf("\n");
        // Accumulate
        for (int r = 0; r < N; r++)
            for (int c = 0; c < N; c++)
                for (int k = t*TILE; k < t*TILE+TILE; k++)
                    C[r][c] += A[r][k] * B[k][c];
    }
    
    int c00 = C[0][0];
    printf("Final C[0][0] = 1*1+2*5 + 3*9+4*13 = %d + %d = %d\n",
           1*1+2*5, 3*9+4*13, c00);
    int naiveReads = N * 2;
    int tiledReads = numPhases * TILE * 2;
    printf("Global reads (naive): %d per element, %d total\n", naiveReads, naiveReads*N*N);
    printf("Global reads (tiled): %d per element, %d total\n", tiledReads, tiledReads*N*N);
    printf("Memory savings: %.2fx\n", (float)naiveReads/tiledReads);
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›