Async Memcpy & Pinned Memory 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

📌 Chapter 9, Part 2: Pinned Memory — The Express Lane to the GPU

💡 Story: Normal CPU memory (pageable) is like regular city roads — the OS can move your data around anytime. When the GPU needs it, the driver must first copy it to a 'staging area' (pinned buffer), THEN send it to the GPU. That's two trips! Pinned memory (page-locked memory) is like a dedicated highway directly to the GPU. No detour, twice as fast!

// Normal (pageable) allocation — CANNOT use with cudaMemcpyAsync!
float* h_data = (float*)malloc(bytes);
cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);  // Synchronous, slow path
free(h_data);

// Pinned (page-locked) allocation — direct DMA to GPU!
float* h_pinned;
cudaMallocHost(&h_pinned, bytes);  // Or: cudaHostAlloc with flags

// Now can use ASYNC memcpy — non-blocking, uses DMA engine
cudaMemcpyAsync(d_data, h_pinned, bytes, cudaMemcpyHostToDevice, stream);
// CPU continues running here while GPU DMA engine copies data!

// Fill next batch while GPU processes current batch!
for (int i = 0; i < n_batches; i++) {
    // Fill pinned buffer with next batch (CPU work)
    prepareBatch(h_pinned, i);  
    // Async send to GPU
    cudaMemcpyAsync(d_batch, h_pinned, batchBytes, cudaMemcpyHostToDevice, stream);
    // Launch kernel for current batch
    processKernel<<<grid, block, 0, stream>>>(d_batch, d_output, batchSize);
    // Async copy result back
    cudaMemcpyAsync(h_result + i*batchSize, d_output, batchBytes, cudaMemcpyDeviceToHost, stream);
}
cudaStreamSynchronize(stream);  // Wait for all to finish

cudaFreeHost(h_pinned);  // Use cudaFreeHost, NOT free()!

Pinned memory trade-offs:

⚡ Transfer speed — Up to 2x faster H2D/D2H transfers (bypasses staging buffer)
🔒 Never paged out — OS cannot swap it to disk; physically contiguous for DMA
⚠️ Limited resource — Too much pinned memory hurts overall system performance
📦 Best use — Allocate once, reuse as a 'transfer buffer' for repeated transfers
🚫 Cannot use — `cudaMemcpyAsync` with regular malloc'd memory (falls back to synchronous)

📋 Instructions

Print a comparison of pageable vs pinned memory transfer characteristics: ``` === Pinned vs Pageable Memory Transfer === [Pageable (malloc) Memory] Allocation: float* h = malloc(bytes) Copy path: CPU_RAM -> [staging buffer] -> GPU Synchronous: YES (cudaMemcpy blocks CPU) Transfer BW: ~6 GB/s (PCIe, pageable) [Pinned (cudaMallocHost) Memory] Allocation: cudaMallocHost(&h, bytes) Copy path: CPU_RAM -> GPU (direct DMA) Asynchronous: YES (cudaMemcpyAsync) Transfer BW: ~12 GB/s (PCIe, pinned) Speedup: 2.00x faster transfers! [Important Rules] cudaMemcpyAsync requires pinned memory Free pinned with cudaFreeHost, NOT free() Don't over-allocate: OS needs pageable memory too ```

Run the code as-is. Key interview answer: 'cudaMemcpyAsync with pinned memory enables overlapping CPU work with GPU data transfer, essentially giving free GPU preprocessing time.' This is a fundamental technique in production ML inference systems.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

int main() {
    printf("=== Pinned vs Pageable Memory Transfer ===\n\n");
    
    printf("[Pageable (malloc) Memory]\n");
    printf("Allocation: float* h = malloc(bytes)\n");
    printf("Copy path:  CPU_RAM -> [staging buffer] -> GPU\n");
    printf("Synchronous: YES (cudaMemcpy blocks CPU)\n");
    printf("Transfer BW: ~6 GB/s (PCIe, pageable)\n\n");
    
    printf("[Pinned (cudaMallocHost) Memory]\n");
    printf("Allocation: cudaMallocHost(&h, bytes)\n");
    printf("Copy path:  CPU_RAM -> GPU (direct DMA)\n");
    printf("Asynchronous: YES (cudaMemcpyAsync)\n");
    double pinnedBW = 12.0, pageableBW = 6.0;
    printf("Transfer BW: ~%.0f GB/s (PCIe, pinned)\n", pinnedBW);
    printf("Speedup: %.2fx faster transfers!\n\n", pinnedBW / pageableBW);
    
    printf("[Important Rules]\n");
    printf("cudaMemcpyAsync requires pinned memory\n");
    printf("Free pinned with cudaFreeHost, NOT free()\n");
    printf("Don't over-allocate: OS needs pageable memory too\n");
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›