Shared Memory — Team Storage 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

⚡ Chapter 4, Part 2: Shared Memory — The Game Changer!

💡 Story: Global memory is like ordering supplies from a warehouse 100 miles away (slow!). Shared memory is like having a supply closet right in your squad's barracks (FAST!). Each block of threads gets its OWN shared memory closet — and all threads in that block can read/write it instantly.

Shared Memory vs Global Memory:

⚡ Speed — ~100× faster than global memory (on-chip!)
📦 Size — Tiny: 48-96 KB per SM (vs GBs for global)
🔒 Scope — Only threads in the SAME BLOCK can access it
⏰ Lifetime — Only exists while the block is executing
🔧 Declare with — __shared__ keyword

#include <cuda_runtime.h>

// Without shared memory: each thread reads from slow global memory 3 times
__global__ void movingAvgSlow(float* in, float* out, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i > 0 && i < n-1) {
        out[i] = (in[i-1] + in[i] + in[i+1]) / 3.0f;  // 3 global reads each!
    }
}

// WITH shared memory: load data once into fast memory, reuse it!
#define BLOCK_SIZE 256
__global__ void movingAvgFast(float* in, float* out, int n) {
    __shared__ float tile[BLOCK_SIZE + 2];  // Extra 2 for halo elements
    
    int global_i = threadIdx.x + blockIdx.x * blockDim.x;
    int local_i  = threadIdx.x + 1;  // offset by 1 for halo
    
    // Load data from global to shared (each thread loads one element)
    tile[local_i] = in[global_i];
    
    // Load halo elements (thread 0 loads left neighbor, last thread loads right)
    if (threadIdx.x == 0)            tile[0] = (global_i > 0) ? in[global_i-1] : 0.0f;
    if (threadIdx.x == BLOCK_SIZE-1) tile[BLOCK_SIZE+1] = (global_i < n-1) ? in[global_i+1] : 0.0f;
    
    __syncthreads();  // WAIT for all threads to finish loading!
    
    // Now compute from FAST shared memory (100x faster reads!)
    if (global_i < n)
        out[global_i] = (tile[local_i-1] + tile[local_i] + tile[local_i+1]) / 3.0f;
}

🔑 The Shared Memory Pattern (used in EVERY optimization):

1️⃣ Declare — __shared__ float tile[SIZE];
2️⃣ Load — Each thread copies its chunk from global → shared memory
3️⃣ Sync — __syncthreads(); — Wait for ALL threads to finish loading
4️⃣ Compute — Do work using fast shared memory
5️⃣ Write back — Store results to global memory

⚠️ Bank Conflicts: Shared memory is organized into 32 'banks'. If multiple threads access the same bank simultaneously (except for broadcasts), you get a bank conflict = serialized = slow! Use padding: __shared__ float tile[SIZE + 1]; to avoid this.

📋 Instructions

Simulate the shared memory loading pattern. Generate an array of 8 values and simulate how threads in a block would load them into shared memory: ``` === Shared Memory Simulation === Global memory: 10 20 30 40 50 60 70 80 Loading into shared memory... Thread 0 loads: global[0]=10 -> shared[0] Thread 1 loads: global[1]=20 -> shared[1] Thread 2 loads: global[2]=30 -> shared[2] Thread 3 loads: global[3]=40 -> shared[3] Thread 4 loads: global[4]=50 -> shared[4] Thread 5 loads: global[5]=60 -> shared[5] Thread 6 loads: global[6]=70 -> shared[6] Thread 7 loads: global[7]=80 -> shared[7] __syncthreads() called - all threads synced! Computing sum from shared memory: 360 ```

This code is already complete! Run it to see the shared memory loading pattern. The key insight: each thread loads ONE element from global memory to shared memory, then __syncthreads() ensures all threads finished before computing.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›