Naive Matrix Multiply 25 XP Medium

Ctrl+Enter Run Ctrl+S Save

✖️ Chapter 8, Part 2: Naive Matrix Multiply — Many Dots, One GPU

💡 Story: C = A × B means each element C[i][j] = dot product of row i of A and column j of B. For a 1000×1000 matrix, that's 1 million elements to compute, each needing 1000 multiplications and additions. That's 1 billion operations total! A CPU takes seconds. A GPU does it in milliseconds because every output element can be computed SIMULTANEOUSLY.

// Naive matrix multiply: each thread computes ONE output element
__global__ void naiveMatMul(float* A, float* B, float* C, int M, int K, int N) {
    // A is M×K, B is K×N, C is M×N
    int row = blockIdx.y * blockDim.y + threadIdx.y;  // Row of C
    int col = blockIdx.x * blockDim.x + threadIdx.x;  // Col of C
    
    if (row < M && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < K; k++) {
            sum += A[row * K + k] * B[k * N + col];
            // A[row][k] × B[k][col] — the dot product!
        }
        C[row * N + col] = sum;
    }
}

// For 1024×1024 matrices (M=K=N=1024):
dim3 block(16, 16);    // 256 threads per block
dim3 grid(64, 64);     // 4096 blocks  
naiveMatMul<<<grid, block>>>(A, B, C, 1024, 1024, 1024);
// Each of the 1M threads reads 1024 values from A and 1024 from B!
// That's 2 BILLION global memory reads — this is slow because of memory bandwidth

Why naive matmul is slow:

📖 Global memory reads — For N×N multiply: each thread reads 2N values from global memory
🔁 Redundant reads — Row i of A is read by ALL N threads computing row i of C (N times!)
📉 Memory bound — GPU compute units sit idle waiting for memory fetches
🚀 Fix: Tiling — Load tile of A and tile of B into shared memory, reuse N times
📊 Arithmetic Intensity — Naive: 1 flop per global read. Tiled: TILE_SIZE flops per global read!

// Complexity and memory analysis for N×N matrix multiply:
// Operations:   O(N³) = for 1024×1024: ~1 billion multiply-adds
// Global reads: O(N³) for naive (both A and B read N times each!)
//               O(N³/TILE) for tiled (TILE times less global memory)
// 
// With TILE=16: 16x fewer global memory accesses → ~16x memory bandwidth reduction
// This is why tiled GEMM is used in every serious ML framework!

📋 Instructions

Multiply two 3×3 matrices and show the dot product computation for the first element: ``` === Naive Matrix Multiply: C = A × B === A (3x3) × B (3x3) = C (3x3) Computing C[0][0] = dot(row0_A, col0_B): A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0] = 1*7 + 2*9 + 3*11 = 7 + 18 + 33 C[0][0] = 58 Full result C = A × B: 58 64 70 139 154 169 220 244 268 On GPU: all 9 elements computed simultaneously! ```

The triple nested loop (row, col, k) is sequential on CPU. On GPU, the outer two loops (row, col) are replaced by thread grid — only the inner 'k' loop remains serial per thread. Each of the N² threads independently computes its own sum, all in parallel.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

#define N 3

void matMul(int A[N][N], int B[N][N], int C[N][N]) {
    for (int r = 0; r < N; r++)
        for (int c = 0; c < N; c++) {
            C[r][c] = 0;
            for (int k = 0; k < N; k++)
                C[r][c] += A[r][k] * B[k][c];
        }
}

int main() {
    int A[N][N] = {{1,2,3},{4,5,6},{7,8,9}};
    int B[N][N] = {{7,8,9},{9,10,11},{11,12,13}};
    int C[N][N] = {0};
    
    printf("=== Naive Matrix Multiply: C = A x B ===\n");
    printf("A (3x3) x B (3x3) = C (3x3)\n\n");
    
    printf("Computing C[0][0] = dot(row0_A, col0_B):\n");
    printf("  A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0]\n");
    printf("  = %d*%d + %d*%d + %d*%d\n", A[0][0],B[0][0], A[0][1],B[1][0], A[0][2],B[2][0]);
    printf("  = %d + %d + %d\n", A[0][0]*B[0][0], A[0][1]*B[1][0], A[0][2]*B[2][0]);
    printf("  C[0][0] = %d\n\n", A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0]);
    
    matMul(A, B, C);
    printf("Full result C = A x B:\n");
    for (int r = 0; r < N; r++) {
        for (int c = 0; c < N; c++) printf("%4d", C[r][c]);
        printf("\n");
    }
    printf("\nOn GPU: all %d elements computed simultaneously!\n", N*N);
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›