cuBLAS: Production MatMul 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🏭 Chapter 8, Part 4: cuBLAS — Let the Experts Handle It

💡 Story: You've learned to write matrix multiply from scratch — impressive! But for real work, NVIDIA's engineers have spent years hyper-optimizing matmul using every trick imaginable: Tensor Cores, double-buffering, register blocking, auto-tuning. Their result is cuBLAS — 10-100x faster than any hand-written kernel. Knowing it exists and how to call it is a crucial professional skill.

#include <cublas_v2.h>
#include <cuda_runtime.h>

// cuBLAS SGEMM (Single-precision GEneral Matrix Multiply)
// Computes: C = alpha * op(A) * op(B) + beta * C

int main() {
    int M = 1024, K = 1024, N = 1024;
    float alpha = 1.0f, beta = 0.0f;
    
    // Allocate device memory
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, M * K * sizeof(float));
    cudaMalloc(&d_B, K * N * sizeof(float));
    cudaMalloc(&d_C, M * N * sizeof(float));
    
    // Initialize cuBLAS handle
    cublasHandle_t handle;
    cublasCreate(&handle);
    
    // SGEMM: C = A × B (alpha=1, beta=0 → pure multiply, no accumulate)
    // NOTE: cuBLAS is column-major! For row-major A × B, compute Bᵀ × Aᵀ
    cublasSgemm(
        handle,
        CUBLAS_OP_N, CUBLAS_OP_N, // no transpose
        N, M, K,                   // dimensions
        &alpha,
        d_B, N,                    // B first (column-major trick!)
        d_A, K,                    // A second
        &beta,
        d_C, N                     // result C
    );
    
    // Cleanup
    cublasDestroy(handle);
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    return 0;
}
// Compile: nvcc -lcublas -o matmul matmul.cu

cuBLAS key points:

🏗️ Handle — Create with `cublasCreate(&handle)`, destroy with `cublasDestroy(handle)`
📐 Column-major — cuBLAS assumes Fortran column-major layout; for row-major, swap A and B with transposed dimensions
⚡ Tensor Cores — On Volta+ GPUs, cuBLAS automatically uses Tensor Cores for FP16/BF16 — 8x+ throughput
🔧 SGEMM vs DGEMM — S=float32, D=float64, H=float16, C=complex float
🤖 PyTorch uses it — `torch.matmul()` and `nn.Linear` call cuBLAS under the hood!

// Tensor Core throughput (A100 GPU):
Operation  Precision  Throughput
FP32       Standard       19.5 TFLOPS
FP16       Tensor Core   312.0 TFLOPS  ← 16x more!
BF16       Tensor Core   312.0 TFLOPS
INT8       Tensor Core   624.0 TFLOPS

// This is why AI training uses mixed precision (FP16)!
// cuBLAS + Tensor Cores is what makes GPUs fast for deep learning

📋 Instructions

Print a cuBLAS function reference card and a comparison of matmul approaches: ``` === cuBLAS & matmul Approaches === [Approach Comparison] Naive GPU: O(N^3) global reads, no shared memory Tiled GPU: O(N^3/T) global reads, with shared memory cuBLAS: Tensor Cores, max hardware utilization [Performance on 1024x1024 (FP32, A100)] Naive GPU: ~50 ms Tiled (T=16): ~5 ms cuBLAS: ~0.5 ms Speedup: cuBLAS is ~100x faster than naive! [cuBLAS Quick Reference] cublasCreate(&handle) // Initialize cublasSgemm(...) // Float32 GEMM cublasHgemm(...) // Float16 GEMM (Tensor Cores) cublasDgemm(...) // Float64 GEMM cublasDestroy(handle) // Cleanup Rule: Use cuBLAS in production, write kernels to learn! ```

Run the code as-is. The key lesson: understanding naive → tiled → cuBLAS progression shows you WHY each optimization matters. In interviews, being able to explain this progression (and the 100x speedup) is impressive. In production code, always use cuBLAS or libraries built on top of it (like PyTorch).

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

int main() {
    printf("=== cuBLAS & matmul Approaches ===\n\n");
    
    printf("[Approach Comparison]\n");
    printf("Naive GPU:     O(N^3) global reads,  no shared memory\n");
    printf("Tiled GPU:     O(N^3/T) global reads, with shared memory\n");
    printf("cuBLAS:        Tensor Cores, max hardware utilization\n\n");
    
    printf("[Performance on 1024x1024 (FP32, A100)]\n");
    double naive = 50.0, tiled = 5.0, cublas = 0.5;
    printf("Naive GPU:     ~%.0f ms\n", naive);
    printf("Tiled (T=16):  ~%.0f ms\n", tiled);
    printf("cuBLAS:        ~%.1f ms\n", cublas);
    printf("Speedup:       cuBLAS is ~%.0fx faster than naive!\n\n", naive/cublas);
    
    printf("[cuBLAS Quick Reference]\n");
    printf("cublasCreate(&handle)    // Initialize\n");
    printf("cublasSgemm(...)         // Float32 GEMM\n");
    printf("cublasHgemm(...)         // Float16 GEMM (Tensor Cores)\n");
    printf("cublasDgemm(...)         // Float64 GEMM\n");
    printf("cublasDestroy(handle)    // Cleanup\n\n");
    
    printf("Rule: Use cuBLAS in production, write kernels to learn!\n");
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›