Overlapping Compute & Transfer 25 XP Hard

Ctrl+Enter Run Ctrl+S Save

🔄 Chapter 9, Part 3: Overlapping — The Ultimate GPU Pipeline

💡 Story: The GPU has separate engines: Copy Engine (for memcpy) and Compute Engine (for kernels). They can run SIMULTANEOUSLY! While your kernel crunches numbers on batch N, the copy engine is already streaming batch N+1 from CPU to GPU. Meanwhile batch N-1's results are being copied back. Three things at once — that's the triple-overlap pipeline. This is how TensorRT and fast inference servers work.

// Triple-overlap pipeline: H2D + Kernel + D2H running concurrently
#define NUM_BATCHES 4
#define BATCH_SIZE 1024

void asyncPipeline(float* h_in, float* h_out, float* d_in, float* d_out) {
    cudaStream_t streams[NUM_BATCHES];
    for (int i = 0; i < NUM_BATCHES; i++)
        cudaStreamCreate(&streams[i]);
    
    for (int i = 0; i < NUM_BATCHES; i++) {
        size_t offset = i * BATCH_SIZE;
        size_t bytes  = BATCH_SIZE * sizeof(float);
        
        // H2D: load batch i into GPU
        cudaMemcpyAsync(d_in  + offset, h_in  + offset, bytes,
                        cudaMemcpyHostToDevice, streams[i]);
        
        // Kernel: process batch i  
        processKernel<<<BATCH_SIZE/256, 256, 0, streams[i]>>>(
            d_in + offset, d_out + offset, BATCH_SIZE);
        
        // D2H: copy results for batch i back
        cudaMemcpyAsync(h_out + offset, d_out + offset, bytes,
                        cudaMemcpyDeviceToHost, streams[i]);
        
        // Each stream handles one batch independently!
        // Stream i: [H2D_i] -> [Kernel_i] -> [D2H_i]
        // All streams can overlap with each other!
    }
    
    for (int i = 0; i < NUM_BATCHES; i++) {
        cudaStreamSynchronize(streams[i]);
        cudaStreamDestroy(streams[i]);
    }
}

GPU engine concurrency:

⚙️ Compute Engine — Executes CUDA kernels
📤 H2D Copy Engine — Handles Host→Device transfers
📥 D2H Copy Engine — Handles Device→Host transfers (separate from H2D!)
🚦 All 3 can run at once — Compute + H2D + D2H simultaneously on modern GPUs
📈 Real speedup — For transfer-heavy workloads: up to 3x throughput vs sequential

📋 Instructions

Simulate a 4-batch overlapping pipeline, showing time units for each stage: ``` === Overlapping Compute & Transfer Pipeline === Batches: 4, Each stage: 1 time unit Sequential (no overlap): [H2D0][K0][D2H0][H2D1][K1][D2H1][H2D2][K2][D2H2][H2D3][K3][D2H3] Total: 12 time units Pipelined (with streams): t=0: [H2D0] t=1: [K0 ] [H2D1] t=2: [D2H0 ] [K1 ] [H2D2] t=3: [D2H1] [K2 ] [H2D3] t=4: [D2H2] [K3 ] t=5: [D2H3] Total: 6 time units Speedup: 2.00x ```

The pipeline fills up in steps — first just H2D, then H2D+Kernel overlap, then all three stages overlap simultaneously. The total time = (num_batches + num_stages - 1) = 4 + 3 - 1 = 6. This is a classic pipeline formula: throughput improves dramatically for large workloads.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

int main() {
    int batches = 4;
    printf("=== Overlapping Compute & Transfer Pipeline ===\n");
    printf("Batches: %d, Each stage: 1 time unit\n\n", batches);
    
    printf("Sequential (no overlap):\n");
    printf("[H2D0][K0][D2H0][H2D1][K1][D2H1][H2D2][K2][D2H2][H2D3][K3][D2H3]\n");
    int seqTime = batches * 3;
    printf("Total: %d time units\n\n", seqTime);
    
    printf("Pipelined (with streams):\n");
    for (int t = 0; t <= batches + 1; t++) {
        printf("t=%d:", t);
        int printed = 0;
        // H2D for batch t
        if (t < batches) { printf(" [H2D%d]", t); printed++; }
        // Kernel for batch t-1
        if (t >= 1 && t-1 < batches) {
            for (int p = 0; p < (printed > 0 ? 0 : 1); p++) printf(" ");
            printf(" [K%d  ]", t-1); printed++;
        }
        // D2H for batch t-2
        if (t >= 2 && t-2 < batches) {
            printf(" [D2H%d]", t-2); printed++;
        }
        if (printed == 0) break;
        printf("\n");
    }
    int pipeTime = batches + 2;
    printf("Total: %d time units\n", pipeTime);
    printf("Speedup: %.2fx\n", (float)seqTime / pipeTime);
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›