CUDA Best Practices 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

✅ Chapter 10, Part 1: Best Practices — The General's Battle Checklist

💡 Story: You've trained through 9 chapters of CUDA warfare. Now the General reviews the complete battle checklist before every mission. These are the rules that separate a junior CUDA programmer from a senior GPU engineer. Internalize them — they will save your performance every single time.

1️⃣ Minimize CPU↔GPU transfers — Keep data on GPU as long as possible. Transfer is 100x slower than GPU compute
2️⃣ Use coalesced memory access — Stride-1 access ALWAYS. Misaligned = wasted bandwidth
3️⃣ Maximize occupancy — 128-256 threads/block, minimize registers & shared memory per thread
4️⃣ Use shared memory — For any data accessed multiple times per block
5️⃣ Avoid warp divergence — Group uniform work in same warp; align branches to warp boundaries
6️⃣ Use asynchronous operations — Streams + pinned memory to overlap compute and transfer
7️⃣ Prefer libraries — cuBLAS, cuDNN, cuFFT, Thrust over custom kernels for standard operations
8️⃣ Profile first, optimize second — Never guess! Use Nsight Systems to find the REAL bottleneck
9️⃣ Check errors — Every CUDA API call can fail. Always check return codes
🔟 Use unified memory for prototyping — cudaMallocManaged() for quick dev, then optimize transfers

// ✅ CUDA Error Checking — Always do this!
#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = (call); \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// Usage:
CUDA_CHECK(cudaMalloc(&d_data, bytes));
CUDA_CHECK(cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice));
kernel<<<grid, block>>>(d_data, n);
CUDA_CHECK(cudaGetLastError());  // Check for kernel launch errors
CUDA_CHECK(cudaDeviceSynchronize());  // Check for kernel execution errors

📋 Instructions

Print the complete CUDA best practices checklist with a self-assessment score: ``` === CUDA Best Practices Checklist === [Memory] [x] Minimize CPU<->GPU transfers [x] Coalesced memory access (stride-1) [x] Use shared memory for reused data [x] Pinned memory for async transfers [Execution] [x] 128-256 threads per block [x] Avoid warp divergence [x] Maximize occupancy [x] Use streams for parallelism [Code Quality] [x] Check all CUDA error codes [x] Use libraries (cuBLAS, cuDNN) [x] Profile before optimizing CUDA Engineer Score: 11/11 Status: GENERAL-LEVEL! ```

Run the code to print your checklist. Before submitting any CUDA code — in a project, assignment, or interview — run through this list mentally. Every item on this list has caused a real production GPU performance bug at some point in history!

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›