⚡ Chapter 4, Part 2: Shared Memory — The Game Changer!
💡 Story: Global memory is like ordering supplies from a warehouse 100 miles away (slow!). Shared memory is like having a supply closet right in your squad's barracks (FAST!). Each block of threads gets its OWN shared memory closet — and all threads in that block can read/write it instantly.
Shared Memory vs Global Memory:
#include <cuda_runtime.h>
// Without shared memory: each thread reads from slow global memory 3 times
__global__ void movingAvgSlow(float* in, float* out, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i > 0 && i < n-1) {
out[i] = (in[i-1] + in[i] + in[i+1]) / 3.0f; // 3 global reads each!
}
}
// WITH shared memory: load data once into fast memory, reuse it!
#define BLOCK_SIZE 256
__global__ void movingAvgFast(float* in, float* out, int n) {
__shared__ float tile[BLOCK_SIZE + 2]; // Extra 2 for halo elements
int global_i = threadIdx.x + blockIdx.x * blockDim.x;
int local_i = threadIdx.x + 1; // offset by 1 for halo
// Load data from global to shared (each thread loads one element)
tile[local_i] = in[global_i];
// Load halo elements (thread 0 loads left neighbor, last thread loads right)
if (threadIdx.x == 0) tile[0] = (global_i > 0) ? in[global_i-1] : 0.0f;
if (threadIdx.x == BLOCK_SIZE-1) tile[BLOCK_SIZE+1] = (global_i < n-1) ? in[global_i+1] : 0.0f;
__syncthreads(); // WAIT for all threads to finish loading!
// Now compute from FAST shared memory (100x faster reads!)
if (global_i < n)
out[global_i] = (tile[local_i-1] + tile[local_i] + tile[local_i+1]) / 3.0f;
}
🔑 The Shared Memory Pattern (used in EVERY optimization):
⚠️ Bank Conflicts: Shared memory is organized into 32 'banks'. If multiple threads access the same bank simultaneously (except for broadcasts), you get a bank conflict = serialized = slow! Use padding: __shared__ float tile[SIZE + 1]; to avoid this.
📋 Instructions
Simulate the shared memory loading pattern. Generate an array of 8 values and simulate how threads in a block would load them into shared memory:
```
=== Shared Memory Simulation ===
Global memory: 10 20 30 40 50 60 70 80
Loading into shared memory...
Thread 0 loads: global[0]=10 -> shared[0]
Thread 1 loads: global[1]=20 -> shared[1]
Thread 2 loads: global[2]=30 -> shared[2]
Thread 3 loads: global[3]=40 -> shared[3]
Thread 4 loads: global[4]=50 -> shared[4]
Thread 5 loads: global[5]=60 -> shared[5]
Thread 6 loads: global[6]=70 -> shared[6]
Thread 7 loads: global[7]=80 -> shared[7]
__syncthreads() called - all threads synced!
Computing sum from shared memory: 360
```
This code is already complete! Run it to see the shared memory loading pattern. The key insight: each thread loads ONE element from global memory to shared memory, then __syncthreads() ensures all threads finished before computing.