CUDA Programming Streams & Async — True Concurrency
💡
Exercise 43

Overlapping Compute & Transfer 25 XP Hard

Ctrl+Enter Run Ctrl+S Save

🔄 Chapter 9, Part 3: Overlapping — The Ultimate GPU Pipeline

💡 Story: The GPU has separate engines: Copy Engine (for memcpy) and Compute Engine (for kernels). They can run SIMULTANEOUSLY! While your kernel crunches numbers on batch N, the copy engine is already streaming batch N+1 from CPU to GPU. Meanwhile batch N-1's results are being copied back. Three things at once — that's the triple-overlap pipeline. This is how TensorRT and fast inference servers work.

// Triple-overlap pipeline: H2D + Kernel + D2H running concurrently #define NUM_BATCHES 4 #define BATCH_SIZE 1024 void asyncPipeline(float* h_in, float* h_out, float* d_in, float* d_out) { cudaStream_t streams[NUM_BATCHES]; for (int i = 0; i < NUM_BATCHES; i++) cudaStreamCreate(&streams[i]); for (int i = 0; i < NUM_BATCHES; i++) { size_t offset = i * BATCH_SIZE; size_t bytes = BATCH_SIZE * sizeof(float); // H2D: load batch i into GPU cudaMemcpyAsync(d_in + offset, h_in + offset, bytes, cudaMemcpyHostToDevice, streams[i]); // Kernel: process batch i processKernel<<<BATCH_SIZE/256, 256, 0, streams[i]>>>( d_in + offset, d_out + offset, BATCH_SIZE); // D2H: copy results for batch i back cudaMemcpyAsync(h_out + offset, d_out + offset, bytes, cudaMemcpyDeviceToHost, streams[i]); // Each stream handles one batch independently! // Stream i: [H2D_i] -> [Kernel_i] -> [D2H_i] // All streams can overlap with each other! } for (int i = 0; i < NUM_BATCHES; i++) { cudaStreamSynchronize(streams[i]); cudaStreamDestroy(streams[i]); } }

GPU engine concurrency:

  • ⚙️ Compute Engine — Executes CUDA kernels
  • 📤 H2D Copy Engine — Handles Host→Device transfers
  • 📥 D2H Copy Engine — Handles Device→Host transfers (separate from H2D!)
  • 🚦 All 3 can run at once — Compute + H2D + D2H simultaneously on modern GPUs
  • 📈 Real speedup — For transfer-heavy workloads: up to 3x throughput vs sequential
📋 Instructions
Simulate a 4-batch overlapping pipeline, showing time units for each stage: ``` === Overlapping Compute & Transfer Pipeline === Batches: 4, Each stage: 1 time unit Sequential (no overlap): [H2D0][K0][D2H0][H2D1][K1][D2H1][H2D2][K2][D2H2][H2D3][K3][D2H3] Total: 12 time units Pipelined (with streams): t=0: [H2D0] t=1: [K0 ] [H2D1] t=2: [D2H0 ] [K1 ] [H2D2] t=3: [D2H1] [K2 ] [H2D3] t=4: [D2H2] [K3 ] t=5: [D2H3] Total: 6 time units Speedup: 2.00x ```
The pipeline fills up in steps — first just H2D, then H2D+Kernel overlap, then all three stages overlap simultaneously. The total time = (num_batches + num_stages - 1) = 4 + 3 - 1 = 6. This is a classic pipeline formula: throughput improves dramatically for large workloads.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.