⚛️ Chapter 5, Part 3: Atomic Operations — The Unbreakable Read-Modify-Write
💡 Story: The only way to safely let all soldiers update the shared scoreboard is to give one soldier at a time exclusive control. Atomic operations are hardware-guaranteed to complete as a single, uninterruptible step — no thread can interfere!
CUDA's atomic intrinsics — the exact solution to race conditions:
#include <cuda_runtime.h>
// CUDA Atomic Operations (work in global OR shared memory)
atomicAdd(ptr, val); // *ptr += val (SAFE! Atomic!)
atomicSub(ptr, val); // *ptr -= val
atomicMax(ptr, val); // *ptr = max(*ptr, val)
atomicMin(ptr, val); // *ptr = min(*ptr, val)
atomicExch(ptr, val); // old = *ptr; *ptr = val; return old
atomicCAS(ptr, compare, val); // Compare-And-Swap
atomicAnd(ptr, val); // *ptr &= val (bitwise AND)
atomicOr(ptr, val); // *ptr |= val (bitwise OR)
atomicXor(ptr, val); // *ptr ^= val (bitwise XOR)
// All atomic functions return the OLD value:
int old = atomicAdd(ptr, 5); // old = *ptr before adding 5
Real example — Histogram using atomics:
__global__ void histogram(int* data, int* hist, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n) {
int bucket = data[i] % 256; // Which histogram bucket?
atomicAdd(&hist[bucket], 1); // SAFE: multiple threads can update different/same buckets
}
}
// Without atomicAdd, multiple threads modifying hist[bucket] simultaneously
// would cause a race condition and give wrong counts!
⚠️ Performance consideration:
// OPTIMIZED: Two-phase atomic (fast for high-contention scenarios)
__global__ void efficientCount(int* data, int* totalCount, int n) {
__shared__ int blockCount; // Block-local counter in shared memory
if (threadIdx.x == 0) blockCount = 0; // Initialize
__syncthreads();
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n && data[i] > 0)
atomicAdd(&blockCount, 1); // Fast: only 256 threads contending
__syncthreads();
if (threadIdx.x == 0) // Only one thread per block updates global
atomicAdd(totalCount, blockCount); // Fewer contentions!
}
📋 Instructions
Simulate atomic operations. Implement a histogram (counting occurrences of values 0-9) for an array of random data:
```
=== Atomic Histogram Simulation ===
Data: 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3
Building histogram atomically...
Value 1: 2 occurrences
Value 2: 1 occurrences
Value 3: 3 occurrences
Value 4: 1 occurrences
Value 5: 3 occurrences
Value 6: 1 occurrences
Value 7: 1 occurrences
Value 8: 1 occurrences
Value 9: 3 occurrences
```
This code is already complete! Run it to see how histograms work. In CUDA, hist[data[i]]++ would be a race condition — we'd use atomicAdd(&hist[data[i]], 1) instead to ensure correctness.