🌊 Chapter 7, Part 4: Stencil Computation — Reading Your Neighbors
💡 Story: In weather simulation, the temperature at each point is influenced by its neighbors — north, south, east, west. This 'update based on neighborhood' is a stencil. Every weather grid cell (thread) needs to read surrounding cells. Load the neighborhood into shared memory first, then every thread computes its update instantly — no re-reading from slow global memory!
// 1D Stencil: out[i] = (in[i-1] + in[i] + in[i+1]) / 3 (simple blur)
#define TILE 256
#define HALO 1 // 1 halo element on each side
__global__ void stencil1D(float* in, float* out, int n) {
__shared__ float tile[TILE + 2 * HALO]; // Tile + HALO on both sides
int gid = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x;
// Load main data
tile[tid + HALO] = (gid < n) ? in[gid] : 0.0f;
// Load LEFT halo (first thread loads left neighbor)
if (tid == 0)
tile[0] = (gid > 0) ? in[gid - 1] : 0.0f;
// Load RIGHT halo (last thread loads right neighbor)
if (tid == blockDim.x - 1)
tile[tid + HALO + 1] = (gid < n-1) ? in[gid + 1] : 0.0f;
__syncthreads(); // Wait for all threads including halo loaders!
// Now compute: use only shared memory
if (gid > 0 && gid < n-1) {
out[gid] = (tile[tid] + tile[tid+1] + tile[tid+2]) / 3.0f;
// tile[tid] = in[i-1] (left halo or previous element)
// tile[tid+1] = in[i] (current)
// tile[tid+2] = in[i+1] (right halo or next element)
}
}
Key concept: Halo (ghost) elements
📋 Instructions
Apply a 1D stencil (3-point average) to [10, 20, 30, 40, 50], keeping boundaries unchanged:
```
=== 1D Stencil: 3-Point Average ===
Input: [10, 20, 30, 40, 50]
Computing stencil (boundary elements unchanged):
out[0] = 10 (boundary)
out[1] = (10 + 20 + 30) / 3 = 20
out[2] = (20 + 30 + 40) / 3 = 30
out[3] = (30 + 40 + 50) / 3 = 40
out[4] = 50 (boundary)
Output: [10, 20, 30, 40, 50]
Note: A 3-point average on a linear ramp = same values!
```
The loop runs from i=1 to i=n-2 (skipping boundaries). Each output is (left + self + right)/3. On GPU, each thread does this independently — but needs its neighbors loaded into shared memory first! The halo region ensures boundary threads can read neighbors from adjacent blocks.