CUDA Programming GPU Memory — The Treasure Map
💡
Exercise 19

Constant Memory 15 XP Medium

Ctrl+Enter Run Ctrl+S Save

📢 Chapter 4, Part 4: Constant Memory — The Royal Broadcast

💡 Story: The King announces something to ALL soldiers via a royal broadcast. Every soldier hears the same message at the same time — instantly! That's constant memory: read-only data that ALL threads read simultaneously with special hardware caching.

Constant Memory:

  • 📖 Read-only — Threads can ONLY read it, never write
  • 📡 Broadcast — When ALL threads in a warp access the SAME address, it's served in 1 cycle!
  • 💾 Size — 64 KB total
  • 🔧 Declare__constant__ keyword, declared globally
  • 📤 Fill — CPU fills it with cudaMemcpyToSymbol()
#include <cuda_runtime.h> #include <stdio.h> // Declare constant memory GLOBALLY (outside any function) __constant__ float kernel_weights[9]; // e.g., 3x3 convolution filter __global__ void applyFilter(float* image, float* output, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; // ALL threads read kernel_weights[0..8] — BROADCAST! // The constant cache serves all 32 warp threads in 1 cycle! float sum = 0.0f; for (int k = 0; k < 9; k++) { if (i + k < n) sum += image[i + k] * kernel_weights[k]; // Fast broadcast read } if (i < n) output[i] = sum; } int main() { float weights[] = {0.1f, 0.1f, 0.1f, 0.1f, 0.2f, 0.1f, 0.1f, 0.1f, 0.1f}; // Upload to constant memory from CPU cudaMemcpyToSymbol(kernel_weights, weights, 9 * sizeof(float)); // ... rest of program ... return 0; }

When to use constant memory:

  • Perfect for — Filter coefficients, lookup tables, mathematical constants, model weights that don't change during a kernel
  • Great when — ALL threads read the SAME values (broadcast!)
  • Avoid when — Different threads read different elements (no benefit, can be slower due to serialization)
  • 📏 Limit — Only 64KB total — don't store large data here

🎯 Real-world use: In deep learning kernels, filter weights for a convolutional layer are often stored in constant memory so they can be broadcast to all threads processing different parts of the input feature map.

📋 Instructions
Simulate constant memory by using compile-time constants. Print a convolution filter application: ``` === Constant Memory Simulation === Filter weights: 0.1 0.2 0.4 0.2 0.1 Input data: 10.0 20.0 30.0 40.0 50.0 All 5 threads reading same weights (broadcast!) Thread 0: sum = 10.0*0.1 + 20.0*0.2 + 30.0*0.4 + 40.0*0.2 + 50.0*0.1 = 30.0 ```
This program is complete! Run it to see the simulation. The key lesson: constant memory is perfect for filter weights because ALL threads read the SAME weights — this is the 'broadcast' scenario where constant memory shines.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.