CUDA Programming Matrix Operations — The Core of AI
💡
Exercise 36

Matrix Addition 15 XP Easy

Ctrl+Enter Run Ctrl+S Save

Chapter 8, Part 1: Matrix Addition — The GPU's Favorite Warm-Up

💡 Story: Two armies each have a 4×4 battle map. The General wants to combine them: add each cell from map A to the corresponding cell in map B. On a CPU, you loop row by row, column by column — N² steps. On a GPU, you assign one soldier to each cell. Every cell is added simultaneously — BOOM, done in 1 step (well, a few clock cycles)!

// Matrix addition: C = A + B // Each thread handles ONE element at position (row, col) __global__ void matAdd(float* A, float* B, float* C, int rows, int cols) { int row = blockIdx.y * blockDim.y + threadIdx.y; // Which row? int col = blockIdx.x * blockDim.x + threadIdx.x; // Which column? if (row < rows && col < cols) { int idx = row * cols + col; // Row-major 2D → 1D index C[idx] = A[idx] + B[idx]; } } // Launch config for a 1024×1024 matrix: dim3 blockSize(16, 16); // 16×16 = 256 threads per block dim3 gridSize(64, 64); // 64×64 = 4096 blocks // Total threads: 4096 × 256 = 1,048,576 (one per element!) matAdd<<<gridSize, blockSize>>>(A, B, C, 1024, 1024);

Why matrix ops matter so much:

  • 🧠 Neural networks are matrix ops — Forward/backward pass = matrix multiplications
  • 📊 Dense layers — Output = Weight_Matrix × Input_Vector + Bias
  • 🖼️ Conv layers — Can be rewritten as matrix multiplication (im2col trick)
  • ⚙️ Transformers — Attention = softmax(Q × Kᵀ / √d) × V, all matrix ops!
  • 🚀 GPU dominance — This is why GPUs are used for AI: massively parallel matmul
// Row-major index formula (CRITICAL to remember!): // For matrix M of dimensions rows×cols: // M[row][col] = M_flat[row * cols + col] // // Example: M is 3×4 (3 rows, 4 cols) // M[1][2] = M_flat[1 * 4 + 2] = M_flat[6] // // Matrix layout in memory: // Row 0: [M[0][0], M[0][1], M[0][2], M[0][3]] // Row 1: [M[1][0], M[1][1], M[1][2], M[1][3]] // Row 2: [M[2][0], M[2][1], M[2][2], M[2][3]]
📋 Instructions
Add two 3×3 matrices and print the result formatted as a matrix: ``` === Matrix Addition C = A + B === Matrix A: Matrix B: 1 2 3 9 8 7 4 5 6 6 5 4 7 8 9 3 2 1 Matrix C (A+B): 10 10 10 10 10 10 10 10 10 All 9 elements computed in parallel on GPU! ```
matrix add is embarrassingly parallel — no thread depends on any other thread's output. That makes it perfect for GPU: each element is completely independent. This same principle applies to element-wise operations in PyTorch/TensorFlow (tensor.add, tensor.mul, ReLU activation, etc.).
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.