🔢 Chapter 7, Part 2: Prefix Scan — Every Result Builds on the Last
💡 Story: Imagine 8 scouts each counting gold coins in their sector. The scout in sector 2 wants to know the TOTAL found by all scouts in sectors 0, 1, and 2 combined. Sector 3 wants sectors 0-3. Every scout wants a running total! That's a prefix scan — and GPUs can compute ALL 8 running totals simultaneously.
// Exclusive prefix scan (scan[i] = sum of all elements BEFORE i)
// Input: [1, 2, 3, 4]
// Output: [0, 1, 3, 6] ← each output is sum of previous inputs
// Inclusive prefix scan (scan[i] = sum INCLUDING element i)
// Input: [1, 2, 3, 4]
// Output: [1, 3, 6, 10] ← each output includes current element
__global__ void inclusiveScan(int* data, int* out, int n) {
__shared__ int sdata[256];
int tid = threadIdx.x;
sdata[tid] = (tid < n) ? data[tid] : 0;
__syncthreads();
// Up-sweep (reduce) phase — build partial sums
for (int stride = 1; stride < n; stride <<= 1) {
int val = 0;
if (tid >= stride) val = sdata[tid - stride];
__syncthreads();
sdata[tid] += val; // Add left neighbor at distance=stride
__syncthreads();
}
// After this loop, sdata[i] = sum of data[0..i] (inclusive scan)
if (tid < n) out[tid] = sdata[tid];
}
Where is prefix scan used?
// Visualizing inclusive scan on [1, 2, 3, 4, 5, 6, 7, 8]:
// Step 1 (stride=1): each element += left neighbor
// [1, 1+2, 2+3, 3+4, 4+5, 5+6, 6+7, 7+8] = [1, 3, 5, 7, 9, 11, 13, 15]
// Step 2 (stride=2): each element += element 2 positions left
// [1, 3, 1+5, 3+7, 9, 11, 9+13, 11+15] = [1, 3, 6, 10, 9, 11, 22, 26]
// Hmm, this needs careful tracking — the Hillis-Steele algorithm!
// Full scan of [1,2,3,4,5,6,7,8] → [1,3,6,10,15,21,28,36]
📋 Instructions
Compute both exclusive and inclusive prefix scans on [1, 2, 3, 4, 5]:
```
=== Prefix Scan Demo ===
Input: [1, 2, 3, 4, 5]
Inclusive Scan (includes self):
out[0] = 1
out[1] = 3
out[2] = 6
out[3] = 10
out[4] = 15
Result: [1, 3, 6, 10, 15]
Exclusive Scan (sum of all BEFORE):
out[0] = 0
out[1] = 1
out[2] = 3
out[3] = 6
out[4] = 10
Result: [0, 1, 3, 6, 10]
Note: exclusive[i] = inclusive[i-1] (with exclusive[0] = 0)
```
Inclusive scan: out[i] = sum of data[0..i]. Serial loop: incl[0]=data[0], incl[i]=incl[i-1]+data[i]. Exclusive: excl[0]=0, excl[i]=incl[i-1]. In GPU, the Hillis-Steele algorithm does this in log₂(n) rounds with all-thread parallelism.