📌 Chapter 9, Part 2: Pinned Memory — The Express Lane to the GPU
💡 Story: Normal CPU memory (pageable) is like regular city roads — the OS can move your data around anytime. When the GPU needs it, the driver must first copy it to a 'staging area' (pinned buffer), THEN send it to the GPU. That's two trips! Pinned memory (page-locked memory) is like a dedicated highway directly to the GPU. No detour, twice as fast!
// Normal (pageable) allocation — CANNOT use with cudaMemcpyAsync!
float* h_data = (float*)malloc(bytes);
cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice); // Synchronous, slow path
free(h_data);
// Pinned (page-locked) allocation — direct DMA to GPU!
float* h_pinned;
cudaMallocHost(&h_pinned, bytes); // Or: cudaHostAlloc with flags
// Now can use ASYNC memcpy — non-blocking, uses DMA engine
cudaMemcpyAsync(d_data, h_pinned, bytes, cudaMemcpyHostToDevice, stream);
// CPU continues running here while GPU DMA engine copies data!
// Fill next batch while GPU processes current batch!
for (int i = 0; i < n_batches; i++) {
// Fill pinned buffer with next batch (CPU work)
prepareBatch(h_pinned, i);
// Async send to GPU
cudaMemcpyAsync(d_batch, h_pinned, batchBytes, cudaMemcpyHostToDevice, stream);
// Launch kernel for current batch
processKernel<<<grid, block, 0, stream>>>(d_batch, d_output, batchSize);
// Async copy result back
cudaMemcpyAsync(h_result + i*batchSize, d_output, batchBytes, cudaMemcpyDeviceToHost, stream);
}
cudaStreamSynchronize(stream); // Wait for all to finish
cudaFreeHost(h_pinned); // Use cudaFreeHost, NOT free()!
Pinned memory trade-offs:
📋 Instructions
Print a comparison of pageable vs pinned memory transfer characteristics:
```
=== Pinned vs Pageable Memory Transfer ===
[Pageable (malloc) Memory]
Allocation: float* h = malloc(bytes)
Copy path: CPU_RAM -> [staging buffer] -> GPU
Synchronous: YES (cudaMemcpy blocks CPU)
Transfer BW: ~6 GB/s (PCIe, pageable)
[Pinned (cudaMallocHost) Memory]
Allocation: cudaMallocHost(&h, bytes)
Copy path: CPU_RAM -> GPU (direct DMA)
Asynchronous: YES (cudaMemcpyAsync)
Transfer BW: ~12 GB/s (PCIe, pinned)
Speedup: 2.00x faster transfers!
[Important Rules]
cudaMemcpyAsync requires pinned memory
Free pinned with cudaFreeHost, NOT free()
Don't over-allocate: OS needs pageable memory too
```
Run the code as-is. Key interview answer: 'cudaMemcpyAsync with pinned memory enables overlapping CPU work with GPU data transfer, essentially giving free GPU preprocessing time.' This is a fundamental technique in production ML inference systems.