CUDA Programming Parallel Patterns
💡
Exercise 34

Stencil Computation 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🌊 Chapter 7, Part 4: Stencil Computation — Reading Your Neighbors

💡 Story: In weather simulation, the temperature at each point is influenced by its neighbors — north, south, east, west. This 'update based on neighborhood' is a stencil. Every weather grid cell (thread) needs to read surrounding cells. Load the neighborhood into shared memory first, then every thread computes its update instantly — no re-reading from slow global memory!

// 1D Stencil: out[i] = (in[i-1] + in[i] + in[i+1]) / 3 (simple blur) #define TILE 256 #define HALO 1 // 1 halo element on each side __global__ void stencil1D(float* in, float* out, int n) { __shared__ float tile[TILE + 2 * HALO]; // Tile + HALO on both sides int gid = threadIdx.x + blockIdx.x * blockDim.x; int tid = threadIdx.x; // Load main data tile[tid + HALO] = (gid < n) ? in[gid] : 0.0f; // Load LEFT halo (first thread loads left neighbor) if (tid == 0) tile[0] = (gid > 0) ? in[gid - 1] : 0.0f; // Load RIGHT halo (last thread loads right neighbor) if (tid == blockDim.x - 1) tile[tid + HALO + 1] = (gid < n-1) ? in[gid + 1] : 0.0f; __syncthreads(); // Wait for all threads including halo loaders! // Now compute: use only shared memory if (gid > 0 && gid < n-1) { out[gid] = (tile[tid] + tile[tid+1] + tile[tid+2]) / 3.0f; // tile[tid] = in[i-1] (left halo or previous element) // tile[tid+1] = in[i] (current) // tile[tid+2] = in[i+1] (right halo or next element) } }

Key concept: Halo (ghost) elements

  • 📦 Halo — Extra border elements loaded beyond a tile's own data, needed for stencil neighbors
  • 🏗️ Tile size — Internal data: TILE (256) + halo on left (1) + halo on right (1) = 258 elements
  • 🤝 Two blocks share — The border element between blocks is loaded by BOTH as a halo — redundant but correct
  • 📐 Stencil radius — For a 5-point stencil (±2 neighbors), use HALO=2, shared size = TILE + 4
  • 🔥 Applications — Image blur, edge detection (Sobel/Laplacian), finite-difference PDE solvers, heat simulation
📋 Instructions
Apply a 1D stencil (3-point average) to [10, 20, 30, 40, 50], keeping boundaries unchanged: ``` === 1D Stencil: 3-Point Average === Input: [10, 20, 30, 40, 50] Computing stencil (boundary elements unchanged): out[0] = 10 (boundary) out[1] = (10 + 20 + 30) / 3 = 20 out[2] = (20 + 30 + 40) / 3 = 30 out[3] = (30 + 40 + 50) / 3 = 40 out[4] = 50 (boundary) Output: [10, 20, 30, 40, 50] Note: A 3-point average on a linear ramp = same values! ```
The loop runs from i=1 to i=n-2 (skipping boundaries). Each output is (left + self + right)/3. On GPU, each thread does this independently — but needs its neighbors loaded into shared memory first! The halo region ensures boundary threads can read neighbors from adjacent blocks.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.