CUDA Programming Matrix Operations — The Core of AI
💡
Exercise 39

cuBLAS: Production MatMul 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🏭 Chapter 8, Part 4: cuBLAS — Let the Experts Handle It

💡 Story: You've learned to write matrix multiply from scratch — impressive! But for real work, NVIDIA's engineers have spent years hyper-optimizing matmul using every trick imaginable: Tensor Cores, double-buffering, register blocking, auto-tuning. Their result is cuBLAS — 10-100x faster than any hand-written kernel. Knowing it exists and how to call it is a crucial professional skill.

#include <cublas_v2.h> #include <cuda_runtime.h> // cuBLAS SGEMM (Single-precision GEneral Matrix Multiply) // Computes: C = alpha * op(A) * op(B) + beta * C int main() { int M = 1024, K = 1024, N = 1024; float alpha = 1.0f, beta = 0.0f; // Allocate device memory float *d_A, *d_B, *d_C; cudaMalloc(&d_A, M * K * sizeof(float)); cudaMalloc(&d_B, K * N * sizeof(float)); cudaMalloc(&d_C, M * N * sizeof(float)); // Initialize cuBLAS handle cublasHandle_t handle; cublasCreate(&handle); // SGEMM: C = A × B (alpha=1, beta=0 → pure multiply, no accumulate) // NOTE: cuBLAS is column-major! For row-major A × B, compute Bᵀ × Aᵀ cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, // no transpose N, M, K, // dimensions &alpha, d_B, N, // B first (column-major trick!) d_A, K, // A second &beta, d_C, N // result C ); // Cleanup cublasDestroy(handle); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); return 0; } // Compile: nvcc -lcublas -o matmul matmul.cu

cuBLAS key points:

  • 🏗️ Handle — Create with `cublasCreate(&handle)`, destroy with `cublasDestroy(handle)`
  • 📐 Column-major — cuBLAS assumes Fortran column-major layout; for row-major, swap A and B with transposed dimensions
  • Tensor Cores — On Volta+ GPUs, cuBLAS automatically uses Tensor Cores for FP16/BF16 — 8x+ throughput
  • 🔧 SGEMM vs DGEMM — S=float32, D=float64, H=float16, C=complex float
  • 🤖 PyTorch uses it — `torch.matmul()` and `nn.Linear` call cuBLAS under the hood!
// Tensor Core throughput (A100 GPU): Operation Precision Throughput FP32 Standard 19.5 TFLOPS FP16 Tensor Core 312.0 TFLOPS ← 16x more! BF16 Tensor Core 312.0 TFLOPS INT8 Tensor Core 624.0 TFLOPS // This is why AI training uses mixed precision (FP16)! // cuBLAS + Tensor Cores is what makes GPUs fast for deep learning
📋 Instructions
Print a cuBLAS function reference card and a comparison of matmul approaches: ``` === cuBLAS & matmul Approaches === [Approach Comparison] Naive GPU: O(N^3) global reads, no shared memory Tiled GPU: O(N^3/T) global reads, with shared memory cuBLAS: Tensor Cores, max hardware utilization [Performance on 1024x1024 (FP32, A100)] Naive GPU: ~50 ms Tiled (T=16): ~5 ms cuBLAS: ~0.5 ms Speedup: cuBLAS is ~100x faster than naive! [cuBLAS Quick Reference] cublasCreate(&handle) // Initialize cublasSgemm(...) // Float32 GEMM cublasHgemm(...) // Float16 GEMM (Tensor Cores) cublasDgemm(...) // Float64 GEMM cublasDestroy(handle) // Cleanup Rule: Use cuBLAS in production, write kernels to learn! ```
Run the code as-is. The key lesson: understanding naive → tiled → cuBLAS progression shows you WHY each optimization matters. In interviews, being able to explain this progression (and the 100x speedup) is impressive. In production code, always use cuBLAS or libraries built on top of it (like PyTorch).
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.