🏭 Chapter 8, Part 4: cuBLAS — Let the Experts Handle It
💡 Story: You've learned to write matrix multiply from scratch — impressive! But for real work, NVIDIA's engineers have spent years hyper-optimizing matmul using every trick imaginable: Tensor Cores, double-buffering, register blocking, auto-tuning. Their result is cuBLAS — 10-100x faster than any hand-written kernel. Knowing it exists and how to call it is a crucial professional skill.
#include <cublas_v2.h>
#include <cuda_runtime.h>
// cuBLAS SGEMM (Single-precision GEneral Matrix Multiply)
// Computes: C = alpha * op(A) * op(B) + beta * C
int main() {
int M = 1024, K = 1024, N = 1024;
float alpha = 1.0f, beta = 0.0f;
// Allocate device memory
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, M * K * sizeof(float));
cudaMalloc(&d_B, K * N * sizeof(float));
cudaMalloc(&d_C, M * N * sizeof(float));
// Initialize cuBLAS handle
cublasHandle_t handle;
cublasCreate(&handle);
// SGEMM: C = A × B (alpha=1, beta=0 → pure multiply, no accumulate)
// NOTE: cuBLAS is column-major! For row-major A × B, compute Bᵀ × Aᵀ
cublasSgemm(
handle,
CUBLAS_OP_N, CUBLAS_OP_N, // no transpose
N, M, K, // dimensions
&alpha,
d_B, N, // B first (column-major trick!)
d_A, K, // A second
&beta,
d_C, N // result C
);
// Cleanup
cublasDestroy(handle);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
return 0;
}
// Compile: nvcc -lcublas -o matmul matmul.cu
// Tensor Core throughput (A100 GPU):
Operation Precision Throughput
FP32 Standard 19.5 TFLOPS
FP16 Tensor Core 312.0 TFLOPS ← 16x more!
BF16 Tensor Core 312.0 TFLOPS
INT8 Tensor Core 624.0 TFLOPS
// This is why AI training uses mixed precision (FP16)!
// cuBLAS + Tensor Cores is what makes GPUs fast for deep learning
📋 Instructions
Print a cuBLAS function reference card and a comparison of matmul approaches:
```
=== cuBLAS & matmul Approaches ===
[Approach Comparison]
Naive GPU: O(N^3) global reads, no shared memory
Tiled GPU: O(N^3/T) global reads, with shared memory
cuBLAS: Tensor Cores, max hardware utilization
[Performance on 1024x1024 (FP32, A100)]
Naive GPU: ~50 ms
Tiled (T=16): ~5 ms
cuBLAS: ~0.5 ms
Speedup: cuBLAS is ~100x faster than naive!
[cuBLAS Quick Reference]
cublasCreate(&handle) // Initialize
cublasSgemm(...) // Float32 GEMM
cublasHgemm(...) // Float16 GEMM (Tensor Cores)
cublasDgemm(...) // Float64 GEMM
cublasDestroy(handle) // Cleanup
Rule: Use cuBLAS in production, write kernels to learn!
```
Run the code as-is. The key lesson: understanding naive → tiled → cuBLAS progression shows you WHY each optimization matters. In interviews, being able to explain this progression (and the 100x speedup) is impressive. In production code, always use cuBLAS or libraries built on top of it (like PyTorch).