Matrix Addition 15 XP Easy

Ctrl+Enter Run Ctrl+S Save

➕ Chapter 8, Part 1: Matrix Addition — The GPU's Favorite Warm-Up

💡 Story: Two armies each have a 4×4 battle map. The General wants to combine them: add each cell from map A to the corresponding cell in map B. On a CPU, you loop row by row, column by column — N² steps. On a GPU, you assign one soldier to each cell. Every cell is added simultaneously — BOOM, done in 1 step (well, a few clock cycles)!

// Matrix addition: C = A + B
// Each thread handles ONE element at position (row, col)
__global__ void matAdd(float* A, float* B, float* C, int rows, int cols) {
    int row = blockIdx.y * blockDim.y + threadIdx.y; // Which row?
    int col = blockIdx.x * blockDim.x + threadIdx.x; // Which column?
    
    if (row < rows && col < cols) {
        int idx = row * cols + col;  // Row-major 2D → 1D index
        C[idx] = A[idx] + B[idx];
    }
}

// Launch config for a 1024×1024 matrix:
dim3 blockSize(16, 16);    // 16×16 = 256 threads per block
dim3 gridSize(64, 64);     // 64×64 = 4096 blocks
// Total threads: 4096 × 256 = 1,048,576 (one per element!)
matAdd<<<gridSize, blockSize>>>(A, B, C, 1024, 1024);

Why matrix ops matter so much:

🧠 Neural networks are matrix ops — Forward/backward pass = matrix multiplications
📊 Dense layers — Output = Weight_Matrix × Input_Vector + Bias
🖼️ Conv layers — Can be rewritten as matrix multiplication (im2col trick)
⚙️ Transformers — Attention = softmax(Q × Kᵀ / √d) × V, all matrix ops!
🚀 GPU dominance — This is why GPUs are used for AI: massively parallel matmul

// Row-major index formula (CRITICAL to remember!):
// For matrix M of dimensions rows×cols:
// M[row][col] = M_flat[row * cols + col]
// 
// Example: M is 3×4 (3 rows, 4 cols)
// M[1][2] = M_flat[1 * 4 + 2] = M_flat[6]
//
// Matrix layout in memory:
// Row 0: [M[0][0], M[0][1], M[0][2], M[0][3]]
// Row 1: [M[1][0], M[1][1], M[1][2], M[1][3]]
// Row 2: [M[2][0], M[2][1], M[2][2], M[2][3]]

📋 Instructions

Add two 3×3 matrices and print the result formatted as a matrix: ``` === Matrix Addition C = A + B === Matrix A: Matrix B: 1 2 3 9 8 7 4 5 6 6 5 4 7 8 9 3 2 1 Matrix C (A+B): 10 10 10 10 10 10 10 10 10 All 9 elements computed in parallel on GPU! ```

matrix add is embarrassingly parallel — no thread depends on any other thread's output. That makes it perfect for GPU: each element is completely independent. This same principle applies to element-wise operations in PyTorch/TensorFlow (tensor.add, tensor.mul, ReLU activation, etc.).

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

#define N 3

void matAddCPU(int A[N][N], int B[N][N], int C[N][N]) {
    for (int r = 0; r < N; r++)
        for (int c = 0; c < N; c++)
            C[r][c] = A[r][c] + B[r][c];
}

void printMat(int M[N][N], const char* name) {
    printf("Matrix %s:\n", name);
    for (int r = 0; r < N; r++) {
        for (int c = 0; c < N; c++) printf("%d ", M[r][c]);
        printf("\n");
    }
    printf("\n");
}

int main() {
    int A[N][N] = {{1,2,3},{4,5,6},{7,8,9}};
    int B[N][N] = {{9,8,7},{6,5,4},{3,2,1}};
    int C[N][N];
    
    printf("=== Matrix Addition C = A + B ===\n\n");
    
    printf("Matrix A:        Matrix B:\n");
    for (int r = 0; r < N; r++) {
        for (int c = 0; c < N; c++) printf("%d ", A[r][c]);
        printf("           ");
        for (int c = 0; c < N; c++) printf("%d ", B[r][c]);
        printf("\n");
    }
    printf("\n");
    
    matAddCPU(A, B, C);
    printMat(C, "C (A+B)");
    
    printf("All %d elements computed in parallel on GPU!\n", N*N);
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›