CUDA Programming GPU Memory — The Treasure Map
💡
Exercise 17

Shared Memory — Team Storage 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

Chapter 4, Part 2: Shared Memory — The Game Changer!

💡 Story: Global memory is like ordering supplies from a warehouse 100 miles away (slow!). Shared memory is like having a supply closet right in your squad's barracks (FAST!). Each block of threads gets its OWN shared memory closet — and all threads in that block can read/write it instantly.

Shared Memory vs Global Memory:

  • Speed — ~100× faster than global memory (on-chip!)
  • 📦 Size — Tiny: 48-96 KB per SM (vs GBs for global)
  • 🔒 Scope — Only threads in the SAME BLOCK can access it
  • Lifetime — Only exists while the block is executing
  • 🔧 Declare with__shared__ keyword
#include <cuda_runtime.h> // Without shared memory: each thread reads from slow global memory 3 times __global__ void movingAvgSlow(float* in, float* out, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i > 0 && i < n-1) { out[i] = (in[i-1] + in[i] + in[i+1]) / 3.0f; // 3 global reads each! } } // WITH shared memory: load data once into fast memory, reuse it! #define BLOCK_SIZE 256 __global__ void movingAvgFast(float* in, float* out, int n) { __shared__ float tile[BLOCK_SIZE + 2]; // Extra 2 for halo elements int global_i = threadIdx.x + blockIdx.x * blockDim.x; int local_i = threadIdx.x + 1; // offset by 1 for halo // Load data from global to shared (each thread loads one element) tile[local_i] = in[global_i]; // Load halo elements (thread 0 loads left neighbor, last thread loads right) if (threadIdx.x == 0) tile[0] = (global_i > 0) ? in[global_i-1] : 0.0f; if (threadIdx.x == BLOCK_SIZE-1) tile[BLOCK_SIZE+1] = (global_i < n-1) ? in[global_i+1] : 0.0f; __syncthreads(); // WAIT for all threads to finish loading! // Now compute from FAST shared memory (100x faster reads!) if (global_i < n) out[global_i] = (tile[local_i-1] + tile[local_i] + tile[local_i+1]) / 3.0f; }

🔑 The Shared Memory Pattern (used in EVERY optimization):

  • 1️⃣ Declare__shared__ float tile[SIZE];
  • 2️⃣ Load — Each thread copies its chunk from global → shared memory
  • 3️⃣ Sync__syncthreads(); — Wait for ALL threads to finish loading
  • 4️⃣ Compute — Do work using fast shared memory
  • 5️⃣ Write back — Store results to global memory

⚠️ Bank Conflicts: Shared memory is organized into 32 'banks'. If multiple threads access the same bank simultaneously (except for broadcasts), you get a bank conflict = serialized = slow! Use padding: __shared__ float tile[SIZE + 1]; to avoid this.

📋 Instructions
Simulate the shared memory loading pattern. Generate an array of 8 values and simulate how threads in a block would load them into shared memory: ``` === Shared Memory Simulation === Global memory: 10 20 30 40 50 60 70 80 Loading into shared memory... Thread 0 loads: global[0]=10 -> shared[0] Thread 1 loads: global[1]=20 -> shared[1] Thread 2 loads: global[2]=30 -> shared[2] Thread 3 loads: global[3]=40 -> shared[3] Thread 4 loads: global[4]=50 -> shared[4] Thread 5 loads: global[5]=60 -> shared[5] Thread 6 loads: global[6]=70 -> shared[6] Thread 7 loads: global[7]=80 -> shared[7] __syncthreads() called - all threads synced! Computing sum from shared memory: 360 ```
This code is already complete! Run it to see the shared memory loading pattern. The key insight: each thread loads ONE element from global memory to shared memory, then __syncthreads() ensures all threads finished before computing.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.