CUDA Programming Sync or Chaos!
💡
Exercise 23

Atomic Operations 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

⚛️ Chapter 5, Part 3: Atomic Operations — The Unbreakable Read-Modify-Write

💡 Story: The only way to safely let all soldiers update the shared scoreboard is to give one soldier at a time exclusive control. Atomic operations are hardware-guaranteed to complete as a single, uninterruptible step — no thread can interfere!

CUDA's atomic intrinsics — the exact solution to race conditions:

#include <cuda_runtime.h> // CUDA Atomic Operations (work in global OR shared memory) atomicAdd(ptr, val); // *ptr += val (SAFE! Atomic!) atomicSub(ptr, val); // *ptr -= val atomicMax(ptr, val); // *ptr = max(*ptr, val) atomicMin(ptr, val); // *ptr = min(*ptr, val) atomicExch(ptr, val); // old = *ptr; *ptr = val; return old atomicCAS(ptr, compare, val); // Compare-And-Swap atomicAnd(ptr, val); // *ptr &= val (bitwise AND) atomicOr(ptr, val); // *ptr |= val (bitwise OR) atomicXor(ptr, val); // *ptr ^= val (bitwise XOR) // All atomic functions return the OLD value: int old = atomicAdd(ptr, 5); // old = *ptr before adding 5

Real example — Histogram using atomics:

__global__ void histogram(int* data, int* hist, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i < n) { int bucket = data[i] % 256; // Which histogram bucket? atomicAdd(&hist[bucket], 1); // SAFE: multiple threads can update different/same buckets } } // Without atomicAdd, multiple threads modifying hist[bucket] simultaneously // would cause a race condition and give wrong counts!

⚠️ Performance consideration:

  • Good — Different threads update DIFFERENT memory locations
  • ⚠️ Slow — Many threads updating the SAME location (serialized!)
  • 🔧 Solution — Use per-block partial results in shared memory first, then atomically combine
  • 📊 Example — Thread-local counter → block-local counter (shared) → global counter
// OPTIMIZED: Two-phase atomic (fast for high-contention scenarios) __global__ void efficientCount(int* data, int* totalCount, int n) { __shared__ int blockCount; // Block-local counter in shared memory if (threadIdx.x == 0) blockCount = 0; // Initialize __syncthreads(); int i = threadIdx.x + blockIdx.x * blockDim.x; if (i < n && data[i] > 0) atomicAdd(&blockCount, 1); // Fast: only 256 threads contending __syncthreads(); if (threadIdx.x == 0) // Only one thread per block updates global atomicAdd(totalCount, blockCount); // Fewer contentions! }
📋 Instructions
Simulate atomic operations. Implement a histogram (counting occurrences of values 0-9) for an array of random data: ``` === Atomic Histogram Simulation === Data: 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 Building histogram atomically... Value 1: 2 occurrences Value 2: 1 occurrences Value 3: 3 occurrences Value 4: 1 occurrences Value 5: 3 occurrences Value 6: 1 occurrences Value 7: 1 occurrences Value 8: 1 occurrences Value 9: 3 occurrences ```
This code is already complete! Run it to see how histograms work. In CUDA, hist[data[i]]++ would be a race condition — we'd use atomicAdd(&hist[data[i]], 1) instead to ensure correctness.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.