CUDA Programming Matrix Operations — The Core of AI
💡
Exercise 37

Naive Matrix Multiply 25 XP Medium

Ctrl+Enter Run Ctrl+S Save

✖️ Chapter 8, Part 2: Naive Matrix Multiply — Many Dots, One GPU

💡 Story: C = A × B means each element C[i][j] = dot product of row i of A and column j of B. For a 1000×1000 matrix, that's 1 million elements to compute, each needing 1000 multiplications and additions. That's 1 billion operations total! A CPU takes seconds. A GPU does it in milliseconds because every output element can be computed SIMULTANEOUSLY.

// Naive matrix multiply: each thread computes ONE output element __global__ void naiveMatMul(float* A, float* B, float* C, int M, int K, int N) { // A is M×K, B is K×N, C is M×N int row = blockIdx.y * blockDim.y + threadIdx.y; // Row of C int col = blockIdx.x * blockDim.x + threadIdx.x; // Col of C if (row < M && col < N) { float sum = 0.0f; for (int k = 0; k < K; k++) { sum += A[row * K + k] * B[k * N + col]; // A[row][k] × B[k][col] — the dot product! } C[row * N + col] = sum; } } // For 1024×1024 matrices (M=K=N=1024): dim3 block(16, 16); // 256 threads per block dim3 grid(64, 64); // 4096 blocks naiveMatMul<<<grid, block>>>(A, B, C, 1024, 1024, 1024); // Each of the 1M threads reads 1024 values from A and 1024 from B! // That's 2 BILLION global memory reads — this is slow because of memory bandwidth

Why naive matmul is slow:

  • 📖 Global memory reads — For N×N multiply: each thread reads 2N values from global memory
  • 🔁 Redundant reads — Row i of A is read by ALL N threads computing row i of C (N times!)
  • 📉 Memory bound — GPU compute units sit idle waiting for memory fetches
  • 🚀 Fix: Tiling — Load tile of A and tile of B into shared memory, reuse N times
  • 📊 Arithmetic Intensity — Naive: 1 flop per global read. Tiled: TILE_SIZE flops per global read!
// Complexity and memory analysis for N×N matrix multiply: // Operations: O(N³) = for 1024×1024: ~1 billion multiply-adds // Global reads: O(N³) for naive (both A and B read N times each!) // O(N³/TILE) for tiled (TILE times less global memory) // // With TILE=16: 16x fewer global memory accesses → ~16x memory bandwidth reduction // This is why tiled GEMM is used in every serious ML framework!
📋 Instructions
Multiply two 3×3 matrices and show the dot product computation for the first element: ``` === Naive Matrix Multiply: C = A × B === A (3x3) × B (3x3) = C (3x3) Computing C[0][0] = dot(row0_A, col0_B): A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0] = 1*7 + 2*9 + 3*11 = 7 + 18 + 33 C[0][0] = 58 Full result C = A × B: 58 64 70 139 154 169 220 244 268 On GPU: all 9 elements computed simultaneously! ```
The triple nested loop (row, col, k) is sequential on CPU. On GPU, the outer two loops (row, col) are replaced by thread grid — only the inner 'k' loop remains serial per thread. Each of the N² threads independently computes its own sum, all in parallel.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.