🔄 Chapter 9, Part 3: Overlapping — The Ultimate GPU Pipeline
💡 Story: The GPU has separate engines: Copy Engine (for memcpy) and Compute Engine (for kernels). They can run SIMULTANEOUSLY! While your kernel crunches numbers on batch N, the copy engine is already streaming batch N+1 from CPU to GPU. Meanwhile batch N-1's results are being copied back. Three things at once — that's the triple-overlap pipeline. This is how TensorRT and fast inference servers work.
// Triple-overlap pipeline: H2D + Kernel + D2H running concurrently
#define NUM_BATCHES 4
#define BATCH_SIZE 1024
void asyncPipeline(float* h_in, float* h_out, float* d_in, float* d_out) {
cudaStream_t streams[NUM_BATCHES];
for (int i = 0; i < NUM_BATCHES; i++)
cudaStreamCreate(&streams[i]);
for (int i = 0; i < NUM_BATCHES; i++) {
size_t offset = i * BATCH_SIZE;
size_t bytes = BATCH_SIZE * sizeof(float);
// H2D: load batch i into GPU
cudaMemcpyAsync(d_in + offset, h_in + offset, bytes,
cudaMemcpyHostToDevice, streams[i]);
// Kernel: process batch i
processKernel<<<BATCH_SIZE/256, 256, 0, streams[i]>>>(
d_in + offset, d_out + offset, BATCH_SIZE);
// D2H: copy results for batch i back
cudaMemcpyAsync(h_out + offset, d_out + offset, bytes,
cudaMemcpyDeviceToHost, streams[i]);
// Each stream handles one batch independently!
// Stream i: [H2D_i] -> [Kernel_i] -> [D2H_i]
// All streams can overlap with each other!
}
for (int i = 0; i < NUM_BATCHES; i++) {
cudaStreamSynchronize(streams[i]);
cudaStreamDestroy(streams[i]);
}
}
📋 Instructions
Simulate a 4-batch overlapping pipeline, showing time units for each stage:
```
=== Overlapping Compute & Transfer Pipeline ===
Batches: 4, Each stage: 1 time unit
Sequential (no overlap):
[H2D0][K0][D2H0][H2D1][K1][D2H1][H2D2][K2][D2H2][H2D3][K3][D2H3]
Total: 12 time units
Pipelined (with streams):
t=0: [H2D0]
t=1: [K0 ] [H2D1]
t=2: [D2H0 ] [K1 ] [H2D2]
t=3: [D2H1] [K2 ] [H2D3]
t=4: [D2H2] [K3 ]
t=5: [D2H3]
Total: 6 time units
Speedup: 2.00x
```
The pipeline fills up in steps — first just H2D, then H2D+Kernel overlap, then all three stages overlap simultaneously. The total time = (num_batches + num_stages - 1) = 4 + 3 - 1 = 6. This is a classic pipeline formula: throughput improves dramatically for large workloads.