Atomic Operations 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

⚛️ Chapter 5, Part 3: Atomic Operations — The Unbreakable Read-Modify-Write

💡 Story: The only way to safely let all soldiers update the shared scoreboard is to give one soldier at a time exclusive control. Atomic operations are hardware-guaranteed to complete as a single, uninterruptible step — no thread can interfere!

CUDA's atomic intrinsics — the exact solution to race conditions:

#include <cuda_runtime.h>

// CUDA Atomic Operations (work in global OR shared memory)
atomicAdd(ptr, val);    // *ptr += val  (SAFE! Atomic!)
atomicSub(ptr, val);    // *ptr -= val
atomicMax(ptr, val);    // *ptr = max(*ptr, val)
atomicMin(ptr, val);    // *ptr = min(*ptr, val)
atomicExch(ptr, val);   // old = *ptr; *ptr = val; return old
atomicCAS(ptr, compare, val); // Compare-And-Swap
atomicAnd(ptr, val);    // *ptr &= val  (bitwise AND)
atomicOr(ptr, val);     // *ptr |= val  (bitwise OR)
atomicXor(ptr, val);    // *ptr ^= val  (bitwise XOR)

// All atomic functions return the OLD value:
int old = atomicAdd(ptr, 5);  // old = *ptr before adding 5

Real example — Histogram using atomics:

__global__ void histogram(int* data, int* hist, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n) {
        int bucket = data[i] % 256;  // Which histogram bucket?
        atomicAdd(&hist[bucket], 1); // SAFE: multiple threads can update different/same buckets
    }
}

// Without atomicAdd, multiple threads modifying hist[bucket] simultaneously
// would cause a race condition and give wrong counts!

⚠️ Performance consideration:

✅ Good — Different threads update DIFFERENT memory locations
⚠️ Slow — Many threads updating the SAME location (serialized!)
🔧 Solution — Use per-block partial results in shared memory first, then atomically combine
📊 Example — Thread-local counter → block-local counter (shared) → global counter

// OPTIMIZED: Two-phase atomic (fast for high-contention scenarios)
__global__ void efficientCount(int* data, int* totalCount, int n) {
    __shared__ int blockCount;  // Block-local counter in shared memory
    
    if (threadIdx.x == 0) blockCount = 0;  // Initialize
    __syncthreads();
    
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n && data[i] > 0)
        atomicAdd(&blockCount, 1);  // Fast: only 256 threads contending
    
    __syncthreads();
    
    if (threadIdx.x == 0)       // Only one thread per block updates global
        atomicAdd(totalCount, blockCount);  // Fewer contentions!  
}

📋 Instructions

Simulate atomic operations. Implement a histogram (counting occurrences of values 0-9) for an array of random data: ``` === Atomic Histogram Simulation === Data: 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 Building histogram atomically... Value 1: 2 occurrences Value 2: 1 occurrences Value 3: 3 occurrences Value 4: 1 occurrences Value 5: 3 occurrences Value 6: 1 occurrences Value 7: 1 occurrences Value 8: 1 occurrences Value 9: 3 occurrences ```

This code is already complete! Run it to see how histograms work. In CUDA, hist[data[i]]++ would be a race condition — we'd use atomicAdd(&hist[data[i]], 1) instead to ensure correctness.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›