CUDA Programming Parallel Patterns
💡
Exercise 33

GPU Histogram 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

📊 Chapter 7, Part 3: GPU Histogram — Counting Votes at Scale

💡 Story: Imagine a giant box with 10 million colored balls, and you want to count how many balls are red, green, blue, etc. On a CPU, you pick up balls one at a time: 10 million operations. On a GPU, 10 million soldiers each pick up one ball and shout their color simultaneously. All red-counters update at the same time — but wait, if two soldiers both try to increment the 'red' counter, you get a race condition! The fix? Local tallying then merging.

// ❌ Naive: All threads hammering the SAME global bins __global__ void naiveHistogram(int* data, int* hist, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i < n) atomicAdd(&hist[data[i]], 1); // Works, but ATOMIC CONTENTION! // With 1M threads all hitting the 256-bin histogram, // many threads compete for the same bin → serialized → slow! }
// ✅ Optimized: Per-block private histogram, then atomic merge __global__ void fastHistogram(int* data, int* hist, int n, int numBins) { extern __shared__ int localHist[]; // Per-block histogram in shared memory! int tid = threadIdx.x; int gid = threadIdx.x + blockIdx.x * blockDim.x; // Step 1: Initialize local histogram to zero for (int i = tid; i < numBins; i += blockDim.x) localHist[i] = 0; __syncthreads(); // Step 2: Each thread increments LOCAL shared-memory histogram // (Atomic, but only within 1 block → much less contention) if (gid < n) atomicAdd(&localHist[data[gid]], 1); __syncthreads(); // Step 3: Atomically merge local histogram into global histogram for (int i = tid; i < numBins; i += blockDim.x) atomicAdd(&hist[i], localHist[i]); // Now global atomics only called numBins times per block (not n times!) } // Launch: fastHistogram<<<grid, block, numBins*sizeof(int)>>>(data, hist, n, 256);
  • 🎯 Key idea — Let each block build a private histogram in shared memory
  • Phase 1 — Per-block atomic increments (low contention — only 256 threads per block compete)
  • 🔗 Phase 2 — One final atomic merge per bin per block into global memory
  • 📉 Contention reduction — From n atomic ops per bin → just numBlocks ops per bin
  • 🖼️ Use case — Image processing, frequency analysis, data binning, AI feature extraction
📋 Instructions
Build a histogram of the values [2,0,1,2,3,1,0,2,3,3,0,1] into 4 bins, showing the two-phase approach: ``` === GPU Histogram (Two-Phase) === Data: [2, 0, 1, 2, 3, 1, 0, 2, 3, 3, 0, 1] Bins: 4 (values 0,1,2,3) --- Phase 1: Build local (per-simulated-block) histograms --- Block 0 processes [2,0,1,2,3,1] -> local: [1,2,2,1] Block 1 processes [0,2,3,3,0,1] -> local: [2,1,1,2] --- Phase 2: Merge into global histogram --- Global histogram: Bin 0: 3 Bin 1: 3 Bin 2: 3 Bin 3: 3 Verification: 3+3+3+3 = 12 (matches input size) ```
Run the code as-is. Notice that 'local' histograms reduce atomic contention: with 256 bins and 256 threads per block, there's ~1 collision per bin vs potentially thousands if all threads hit the global histogram directly.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.