CUDA Programming Streams & Async — True Concurrency
💡
Exercise 44

CUDA Events & Timing 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

⏱️ Chapter 9, Part 4: CUDA Events — The GPU's Stopwatch

💡 Story: You want to know: 'How long does my kernel take?' You can't use CPU timers — the CPU doesn't wait for the GPU! You need a GPU-side stopwatch. CUDA events are timestamps recorded IN the GPU command queue. Record an event before your kernel, record another after, compute the difference — that's proper GPU timing.

// CUDA Events for GPU timing cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); // Record START timestamp in the GPU stream cudaEventRecord(start, 0); // second arg = stream (0 = default) // Launch kernel myKernel<<<grid, block>>>(d_data, n); // Record STOP timestamp after kernel cudaEventRecord(stop, 0); // CPU must wait for stop event to be recorded by GPU cudaEventSynchronize(stop); // Get elapsed time in milliseconds float ms; cudaEventElapsedTime(&ms, start, stop); printf("Kernel time: %.3f ms\n", ms); printf("Throughput: %.2f GB/s\n", bytes / ms / 1e6); cudaEventDestroy(start); cudaEventDestroy(stop);

Events beyond timing — stream synchronization:

// Events can synchronize between streams! cudaEvent_t checkpoint; cudaEventCreate(&checkpoint); // Stream 1 finishes preprocessing: preprocessKernel<<<g, b, 0, stream1>>>(d_data); cudaEventRecord(checkpoint, stream1); // Mark completion in stream1 // Stream 2 waits for stream1's checkpoint before running: cudaStreamWaitEvent(stream2, checkpoint, 0); // stream2 waits! mainKernel<<<g, b, 0, stream2>>>(d_data, d_output); // Now stream2 guaranteed to run AFTER stream1 finishes preprocessing // This is how you create dependencies between streams with fine control! cudaEventDestroy(checkpoint);
  • cudaEventRecord(e, stream) — Insert timestamp marker into a stream
  • cudaEventSynchronize(e) — CPU blocks until GPU reaches that event
  • 📏 cudaEventElapsedTime(&ms, start, stop) — Get time between two events in ms
  • 🔗 cudaStreamWaitEvent(stream, event, 0) — Make a stream wait for an event from another stream
  • 💡 Resolution — CUDA events have ~0.5 microsecond resolution — much finer than CPU timers for GPU work
📋 Instructions
Simulate timing a kernel that processes N=1,000,000 floats. Compute bandwidth: ``` === CUDA Event Timing Simulation === Kernel: saxpy (y = a*x + y) on N=1000000 elements cudaEventRecord(start) [Kernel executes on GPU...] cudaEventRecord(stop) cudaEventSynchronize(stop) Measured kernel time: 0.720 ms Data transferred: 2 arrays * 1000000 * 4 bytes = 8.00 MB Effective bandwidth: 11.11 GB/s Comparison: Peak PCIe bandwidth (GPU-CPU): ~16 GB/s Kernel compute bandwidth: 11.11 GB/s Utilization: 69.4% ```
bandwidth = bytes / time. For saxpy, we read x and write y = 2 × N × 4 bytes. With time in milliseconds, divide by 1e6 to get GB/s (since bytes/ms = KB/s... wait: bytes/(ms) × (1000ms/s) / (1e9 bytes/GB) = bytes/ms / 1e6 GB/s). CUDA event timing is essential for performance optimization — always measure before optimizing!
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.