✖️ Chapter 8, Part 2: Naive Matrix Multiply — Many Dots, One GPU
💡 Story: C = A × B means each element C[i][j] = dot product of row i of A and column j of B. For a 1000×1000 matrix, that's 1 million elements to compute, each needing 1000 multiplications and additions. That's 1 billion operations total! A CPU takes seconds. A GPU does it in milliseconds because every output element can be computed SIMULTANEOUSLY.
// Naive matrix multiply: each thread computes ONE output element
__global__ void naiveMatMul(float* A, float* B, float* C, int M, int K, int N) {
// A is M×K, B is K×N, C is M×N
int row = blockIdx.y * blockDim.y + threadIdx.y; // Row of C
int col = blockIdx.x * blockDim.x + threadIdx.x; // Col of C
if (row < M && col < N) {
float sum = 0.0f;
for (int k = 0; k < K; k++) {
sum += A[row * K + k] * B[k * N + col];
// A[row][k] × B[k][col] — the dot product!
}
C[row * N + col] = sum;
}
}
// For 1024×1024 matrices (M=K=N=1024):
dim3 block(16, 16); // 256 threads per block
dim3 grid(64, 64); // 4096 blocks
naiveMatMul<<<grid, block>>>(A, B, C, 1024, 1024, 1024);
// Each of the 1M threads reads 1024 values from A and 1024 from B!
// That's 2 BILLION global memory reads — this is slow because of memory bandwidth
Why naive matmul is slow:
// Complexity and memory analysis for N×N matrix multiply:
// Operations: O(N³) = for 1024×1024: ~1 billion multiply-adds
// Global reads: O(N³) for naive (both A and B read N times each!)
// O(N³/TILE) for tiled (TILE times less global memory)
//
// With TILE=16: 16x fewer global memory accesses → ~16x memory bandwidth reduction
// This is why tiled GEMM is used in every serious ML framework!
📋 Instructions
Multiply two 3×3 matrices and show the dot product computation for the first element:
```
=== Naive Matrix Multiply: C = A × B ===
A (3x3) × B (3x3) = C (3x3)
Computing C[0][0] = dot(row0_A, col0_B):
A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0]
= 1*7 + 2*9 + 3*11
= 7 + 18 + 33
C[0][0] = 58
Full result C = A × B:
58 64 70
139 154 169
220 244 268
On GPU: all 9 elements computed simultaneously!
```
The triple nested loop (row, col, k) is sequential on CPU. On GPU, the outer two loops (row, col) are replaced by thread grid — only the inner 'k' loop remains serial per thread. Each of the N² threads independently computes its own sum, all in parallel.