➕ Chapter 8, Part 1: Matrix Addition — The GPU's Favorite Warm-Up
💡 Story: Two armies each have a 4×4 battle map. The General wants to combine them: add each cell from map A to the corresponding cell in map B. On a CPU, you loop row by row, column by column — N² steps. On a GPU, you assign one soldier to each cell. Every cell is added simultaneously — BOOM, done in 1 step (well, a few clock cycles)!
// Matrix addition: C = A + B
// Each thread handles ONE element at position (row, col)
__global__ void matAdd(float* A, float* B, float* C, int rows, int cols) {
int row = blockIdx.y * blockDim.y + threadIdx.y; // Which row?
int col = blockIdx.x * blockDim.x + threadIdx.x; // Which column?
if (row < rows && col < cols) {
int idx = row * cols + col; // Row-major 2D → 1D index
C[idx] = A[idx] + B[idx];
}
}
// Launch config for a 1024×1024 matrix:
dim3 blockSize(16, 16); // 16×16 = 256 threads per block
dim3 gridSize(64, 64); // 64×64 = 4096 blocks
// Total threads: 4096 × 256 = 1,048,576 (one per element!)
matAdd<<<gridSize, blockSize>>>(A, B, C, 1024, 1024);
Why matrix ops matter so much:
// Row-major index formula (CRITICAL to remember!):
// For matrix M of dimensions rows×cols:
// M[row][col] = M_flat[row * cols + col]
//
// Example: M is 3×4 (3 rows, 4 cols)
// M[1][2] = M_flat[1 * 4 + 2] = M_flat[6]
//
// Matrix layout in memory:
// Row 0: [M[0][0], M[0][1], M[0][2], M[0][3]]
// Row 1: [M[1][0], M[1][1], M[1][2], M[1][3]]
// Row 2: [M[2][0], M[2][1], M[2][2], M[2][3]]
matrix add is embarrassingly parallel — no thread depends on any other thread's output. That makes it perfect for GPU: each element is completely independent. This same principle applies to element-wise operations in PyTorch/TensorFlow (tensor.add, tensor.mul, ReLU activation, etc.).