CUDA Programming Streams & Async — True Concurrency
💡
Exercise 41

CUDA Streams 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🌊 Chapter 9, Part 1: CUDA Streams — Running Multiple Operations at Once

💡 Story: Imagine the GPU has multiple 'assembly lines' (streams). Without streams, everything goes down ONE line — load data, then compute, then copy back, in strict sequence. With streams, you can load batch 2 on line 2 WHILE batch 1 is computing on line 1, WHILE batch 0's results copy back on line 3. Real parallel pipelining!

cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); // WITHOUT streams (sequential — both operations in default stream 0): cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice); kernel<<<grid, block>>>(d_A, d_B); // Wait for memcpy to finish first! cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice); kernel2<<<grid, block>>>(d_B, d_C); // WITH streams (overlapped!): cudaMemcpyAsync(d_A, h_A, bytes, cudaMemcpyHostToDevice, stream1); // Async! cudaMemcpyAsync(d_B, h_B, bytes, cudaMemcpyHostToDevice, stream2); // Concurrent! kernel1<<<grid, block, 0, stream1>>>(d_A, out1); // 4th arg = stream! kernel2<<<grid, block, 0, stream2>>>(d_B, out2); // Runs concurrently with kernel1! // Synchronize when needed: cudaStreamSynchronize(stream1); // Wait for stream1 to finish cudaStreamSynchronize(stream2); // Wait for stream2 to finish cudaStreamDestroy(stream1); cudaStreamDestroy(stream2);

Stream rules and facts:

  • 🌊 Default stream (stream 0) — Synchronizes with ALL other streams; use named streams for concurrency
  • 📋 In-order per stream — Within one stream, operations execute in order (memcpy → kernel → memcpy)
  • 🔀 Between streams — Operations can overlap if GPU has available resources
  • 🔒 cudaStreamSynchronize — CPU waits for a specific stream to finish
  • 🔒 cudaDeviceSynchronize — CPU waits for ALL streams on the device
📋 Instructions
Simulate a 2-stream pipeline showing the execution timeline: ``` === CUDA Streams: 2-Stream Pipeline === Without Streams (Sequential Timeline): [H2D: A] -> [Kernel A] -> [D2H: A] -> [H2D: B] -> [Kernel B] -> [D2H: B] Total time: 6 units With 2 Streams (Pipelined Timeline): Stream 1: [H2D: A] -> [Kernel A] -> [D2H: A] Stream 2: [H2D: B] -> [Kernel B] -> [D2H: B] Timeline (each unit = 1 time slot): t=0: Stream1: H2D:A t=1: Stream1: Kernel:A | Stream2: H2D:B t=2: Stream1: D2H:A | Stream2: Kernel:B t=3: Stream2: D2H:B Total time: 4 units Speedup: 1.50x ```
Run the code as-is. The 1.5x speedup comes from overlapping operations across streams. With more streams and longer pipelines, the speedup approaches 2x (the two paths run in parallel). Real GPU training frameworks like PyTorch use streams extensively for overlapping forward passes and data loading.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.