🌊 Chapter 9, Part 1: CUDA Streams — Running Multiple Operations at Once
💡 Story: Imagine the GPU has multiple 'assembly lines' (streams). Without streams, everything goes down ONE line — load data, then compute, then copy back, in strict sequence. With streams, you can load batch 2 on line 2 WHILE batch 1 is computing on line 1, WHILE batch 0's results copy back on line 3. Real parallel pipelining!
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
// WITHOUT streams (sequential — both operations in default stream 0):
cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
kernel<<<grid, block>>>(d_A, d_B); // Wait for memcpy to finish first!
cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);
kernel2<<<grid, block>>>(d_B, d_C);
// WITH streams (overlapped!):
cudaMemcpyAsync(d_A, h_A, bytes, cudaMemcpyHostToDevice, stream1); // Async!
cudaMemcpyAsync(d_B, h_B, bytes, cudaMemcpyHostToDevice, stream2); // Concurrent!
kernel1<<<grid, block, 0, stream1>>>(d_A, out1); // 4th arg = stream!
kernel2<<<grid, block, 0, stream2>>>(d_B, out2); // Runs concurrently with kernel1!
// Synchronize when needed:
cudaStreamSynchronize(stream1); // Wait for stream1 to finish
cudaStreamSynchronize(stream2); // Wait for stream2 to finish
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
Run the code as-is. The 1.5x speedup comes from overlapping operations across streams. With more streams and longer pipelines, the speedup approaches 2x (the two paths run in parallel). Real GPU training frameworks like PyTorch use streams extensively for overlapping forward passes and data loading.