CUDA Events & Timing 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

⏱️ Chapter 9, Part 4: CUDA Events — The GPU's Stopwatch

💡 Story: You want to know: 'How long does my kernel take?' You can't use CPU timers — the CPU doesn't wait for the GPU! You need a GPU-side stopwatch. CUDA events are timestamps recorded IN the GPU command queue. Record an event before your kernel, record another after, compute the difference — that's proper GPU timing.

// CUDA Events for GPU timing
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

// Record START timestamp in the GPU stream
cudaEventRecord(start, 0);  // second arg = stream (0 = default)

// Launch kernel
myKernel<<<grid, block>>>(d_data, n);

// Record STOP timestamp after kernel
cudaEventRecord(stop, 0);

// CPU must wait for stop event to be recorded by GPU
cudaEventSynchronize(stop);

// Get elapsed time in milliseconds
float ms;
cudaEventElapsedTime(&ms, start, stop);
printf("Kernel time: %.3f ms\n", ms);
printf("Throughput: %.2f GB/s\n", bytes / ms / 1e6);

cudaEventDestroy(start);
cudaEventDestroy(stop);

Events beyond timing — stream synchronization:

// Events can synchronize between streams!
cudaEvent_t checkpoint;
cudaEventCreate(&checkpoint);

// Stream 1 finishes preprocessing:
preprocessKernel<<<g, b, 0, stream1>>>(d_data);
cudaEventRecord(checkpoint, stream1);  // Mark completion in stream1

// Stream 2 waits for stream1's checkpoint before running:
cudaStreamWaitEvent(stream2, checkpoint, 0);  // stream2 waits!
mainKernel<<<g, b, 0, stream2>>>(d_data, d_output);
// Now stream2 guaranteed to run AFTER stream1 finishes preprocessing

// This is how you create dependencies between streams with fine control!
cudaEventDestroy(checkpoint);

⏩ cudaEventRecord(e, stream) — Insert timestamp marker into a stream
⏳ cudaEventSynchronize(e) — CPU blocks until GPU reaches that event
📏 cudaEventElapsedTime(&ms, start, stop) — Get time between two events in ms
🔗 cudaStreamWaitEvent(stream, event, 0) — Make a stream wait for an event from another stream
💡 Resolution — CUDA events have ~0.5 microsecond resolution — much finer than CPU timers for GPU work

📋 Instructions

Simulate timing a kernel that processes N=1,000,000 floats. Compute bandwidth: ``` === CUDA Event Timing Simulation === Kernel: saxpy (y = a*x + y) on N=1000000 elements cudaEventRecord(start) [Kernel executes on GPU...] cudaEventRecord(stop) cudaEventSynchronize(stop) Measured kernel time: 0.720 ms Data transferred: 2 arrays * 1000000 * 4 bytes = 8.00 MB Effective bandwidth: 11.11 GB/s Comparison: Peak PCIe bandwidth (GPU-CPU): ~16 GB/s Kernel compute bandwidth: 11.11 GB/s Utilization: 69.4% ```

bandwidth = bytes / time. For saxpy, we read x and write y = 2 × N × 4 bytes. With time in milliseconds, divide by 1e6 to get GB/s (since bytes/ms = KB/s... wait: bytes/(ms) × (1000ms/s) / (1e9 bytes/GB) = bytes/ms / 1e6 GB/s). CUDA event timing is essential for performance optimization — always measure before optimizing!

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>

int main() {
    int N = 1000000;
    double kernelMs = 0.720;  // simulated measurement
    int bytesPerFloat = 4;
    int arrays = 2;  // read x, write y
    
    printf("=== CUDA Event Timing Simulation ===\n\n");
    printf("Kernel: saxpy (y = a*x + y) on N=%d elements\n\n", N);
    printf("cudaEventRecord(start)\n");
    printf("[Kernel executes on GPU...]\n");
    printf("cudaEventRecord(stop)\n");
    printf("cudaEventSynchronize(stop)\n\n");
    
    printf("Measured kernel time: %.3f ms\n", kernelMs);
    
    double bytesTotal = (double)N * bytesPerFloat * arrays;
    double bytesMB = bytesTotal / (1024.0 * 1024.0);
    double bandwidthGBs = bytesTotal / kernelMs / 1e6;
    printf("Data transferred: %d arrays * %d * %d bytes = %.2f MB\n",
           arrays, N, bytesPerFloat, bytesMB);
    printf("Effective bandwidth: %.2f GB/s\n", bandwidthGBs);
    
    double peakBW = 16.0;  // PCIe
    printf("\nComparison:\n");
    printf("  Peak PCIe bandwidth (GPU-CPU): ~%.0f GB/s\n", peakBW);
    printf("  Kernel compute bandwidth: %.2f GB/s\n", bandwidthGBs);
    printf("  Utilization: %.1f%%\n", bandwidthGBs / peakBW * 100);
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›