CUDA Programming Streams & Async — True Concurrency
💡
Exercise 42

Async Memcpy & Pinned Memory 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

📌 Chapter 9, Part 2: Pinned Memory — The Express Lane to the GPU

💡 Story: Normal CPU memory (pageable) is like regular city roads — the OS can move your data around anytime. When the GPU needs it, the driver must first copy it to a 'staging area' (pinned buffer), THEN send it to the GPU. That's two trips! Pinned memory (page-locked memory) is like a dedicated highway directly to the GPU. No detour, twice as fast!

// Normal (pageable) allocation — CANNOT use with cudaMemcpyAsync! float* h_data = (float*)malloc(bytes); cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice); // Synchronous, slow path free(h_data); // Pinned (page-locked) allocation — direct DMA to GPU! float* h_pinned; cudaMallocHost(&h_pinned, bytes); // Or: cudaHostAlloc with flags // Now can use ASYNC memcpy — non-blocking, uses DMA engine cudaMemcpyAsync(d_data, h_pinned, bytes, cudaMemcpyHostToDevice, stream); // CPU continues running here while GPU DMA engine copies data! // Fill next batch while GPU processes current batch! for (int i = 0; i < n_batches; i++) { // Fill pinned buffer with next batch (CPU work) prepareBatch(h_pinned, i); // Async send to GPU cudaMemcpyAsync(d_batch, h_pinned, batchBytes, cudaMemcpyHostToDevice, stream); // Launch kernel for current batch processKernel<<<grid, block, 0, stream>>>(d_batch, d_output, batchSize); // Async copy result back cudaMemcpyAsync(h_result + i*batchSize, d_output, batchBytes, cudaMemcpyDeviceToHost, stream); } cudaStreamSynchronize(stream); // Wait for all to finish cudaFreeHost(h_pinned); // Use cudaFreeHost, NOT free()!

Pinned memory trade-offs:

  • Transfer speed — Up to 2x faster H2D/D2H transfers (bypasses staging buffer)
  • 🔒 Never paged out — OS cannot swap it to disk; physically contiguous for DMA
  • ⚠️ Limited resource — Too much pinned memory hurts overall system performance
  • 📦 Best use — Allocate once, reuse as a 'transfer buffer' for repeated transfers
  • 🚫 Cannot use — `cudaMemcpyAsync` with regular malloc'd memory (falls back to synchronous)
📋 Instructions
Print a comparison of pageable vs pinned memory transfer characteristics: ``` === Pinned vs Pageable Memory Transfer === [Pageable (malloc) Memory] Allocation: float* h = malloc(bytes) Copy path: CPU_RAM -> [staging buffer] -> GPU Synchronous: YES (cudaMemcpy blocks CPU) Transfer BW: ~6 GB/s (PCIe, pageable) [Pinned (cudaMallocHost) Memory] Allocation: cudaMallocHost(&h, bytes) Copy path: CPU_RAM -> GPU (direct DMA) Asynchronous: YES (cudaMemcpyAsync) Transfer BW: ~12 GB/s (PCIe, pinned) Speedup: 2.00x faster transfers! [Important Rules] cudaMemcpyAsync requires pinned memory Free pinned with cudaFreeHost, NOT free() Don't over-allocate: OS needs pageable memory too ```
Run the code as-is. Key interview answer: 'cudaMemcpyAsync with pinned memory enables overlapping CPU work with GPU data transfer, essentially giving free GPU preprocessing time.' This is a fundamental technique in production ML inference systems.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.