📢 Chapter 4, Part 4: Constant Memory — The Royal Broadcast
💡 Story: The King announces something to ALL soldiers via a royal broadcast. Every soldier hears the same message at the same time — instantly! That's constant memory: read-only data that ALL threads read simultaneously with special hardware caching.
#include <cuda_runtime.h>
#include <stdio.h>
// Declare constant memory GLOBALLY (outside any function)
__constant__ float kernel_weights[9]; // e.g., 3x3 convolution filter
__global__ void applyFilter(float* image, float* output, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
// ALL threads read kernel_weights[0..8] — BROADCAST!
// The constant cache serves all 32 warp threads in 1 cycle!
float sum = 0.0f;
for (int k = 0; k < 9; k++) {
if (i + k < n)
sum += image[i + k] * kernel_weights[k]; // Fast broadcast read
}
if (i < n) output[i] = sum;
}
int main() {
float weights[] = {0.1f, 0.1f, 0.1f, 0.1f, 0.2f, 0.1f, 0.1f, 0.1f, 0.1f};
// Upload to constant memory from CPU
cudaMemcpyToSymbol(kernel_weights, weights, 9 * sizeof(float));
// ... rest of program ...
return 0;
}
When to use constant memory:
🎯 Real-world use: In deep learning kernels, filter weights for a convolutional layer are often stored in constant memory so they can be broadcast to all threads processing different parts of the input feature map.
📋 Instructions
Simulate constant memory by using compile-time constants. Print a convolution filter application:
```
=== Constant Memory Simulation ===
Filter weights: 0.1 0.2 0.4 0.2 0.1
Input data: 10.0 20.0 30.0 40.0 50.0
All 5 threads reading same weights (broadcast!)
Thread 0: sum = 10.0*0.1 + 20.0*0.2 + 30.0*0.4 + 40.0*0.2 + 50.0*0.1 = 30.0
```
This program is complete! Run it to see the simulation. The key lesson: constant memory is perfect for filter weights because ALL threads read the SAME weights — this is the 'broadcast' scenario where constant memory shines.