⏱️ Chapter 9, Part 4: CUDA Events — The GPU's Stopwatch
💡 Story: You want to know: 'How long does my kernel take?' You can't use CPU timers — the CPU doesn't wait for the GPU! You need a GPU-side stopwatch. CUDA events are timestamps recorded IN the GPU command queue. Record an event before your kernel, record another after, compute the difference — that's proper GPU timing.
// CUDA Events for GPU timing
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Record START timestamp in the GPU stream
cudaEventRecord(start, 0); // second arg = stream (0 = default)
// Launch kernel
myKernel<<<grid, block>>>(d_data, n);
// Record STOP timestamp after kernel
cudaEventRecord(stop, 0);
// CPU must wait for stop event to be recorded by GPU
cudaEventSynchronize(stop);
// Get elapsed time in milliseconds
float ms;
cudaEventElapsedTime(&ms, start, stop);
printf("Kernel time: %.3f ms\n", ms);
printf("Throughput: %.2f GB/s\n", bytes / ms / 1e6);
cudaEventDestroy(start);
cudaEventDestroy(stop);
Events beyond timing — stream synchronization:
// Events can synchronize between streams!
cudaEvent_t checkpoint;
cudaEventCreate(&checkpoint);
// Stream 1 finishes preprocessing:
preprocessKernel<<<g, b, 0, stream1>>>(d_data);
cudaEventRecord(checkpoint, stream1); // Mark completion in stream1
// Stream 2 waits for stream1's checkpoint before running:
cudaStreamWaitEvent(stream2, checkpoint, 0); // stream2 waits!
mainKernel<<<g, b, 0, stream2>>>(d_data, d_output);
// Now stream2 guaranteed to run AFTER stream1 finishes preprocessing
// This is how you create dependencies between streams with fine control!
cudaEventDestroy(checkpoint);
📋 Instructions
Simulate timing a kernel that processes N=1,000,000 floats. Compute bandwidth:
```
=== CUDA Event Timing Simulation ===
Kernel: saxpy (y = a*x + y) on N=1000000 elements
cudaEventRecord(start)
[Kernel executes on GPU...]
cudaEventRecord(stop)
cudaEventSynchronize(stop)
Measured kernel time: 0.720 ms
Data transferred: 2 arrays * 1000000 * 4 bytes = 8.00 MB
Effective bandwidth: 11.11 GB/s
Comparison:
Peak PCIe bandwidth (GPU-CPU): ~16 GB/s
Kernel compute bandwidth: 11.11 GB/s
Utilization: 69.4%
```
bandwidth = bytes / time. For saxpy, we read x and write y = 2 × N × 4 bytes. With time in milliseconds, divide by 1e6 to get GB/s (since bytes/ms = KB/s... wait: bytes/(ms) × (1000ms/s) / (1e9 bytes/GB) = bytes/ms / 1e6 GB/s). CUDA event timing is essential for performance optimization — always measure before optimizing!