🚧 Chapter 5, Part 4: Memory Fences — Order in the Memory System
💡 Story: Soldiers are doing operations, but the memory system is like a dishonest messenger — it sometimes reorders deliveries to be more 'efficient'. Memory fences are orders to the messenger: "Deliver ALL previous messages BEFORE any new ones!"
Modern processors (including GPUs) reorder memory operations for performance. Sometimes this causes bugs — memory fences enforce ordering:
#include <cuda_runtime.h>
// CUDA Memory Fence Functions:
__threadfence_block(); // Fence for ALL threads in same BLOCK
__threadfence(); // Fence for ALL threads on same GPU (all SMs)
__threadfence_system(); // Fence for ALL threads (GPU + CPU)
// Example use case: producer-consumer pattern between blocks
__global__ void producer(int* data, int* flag) {
data[0] = 42; // Write data
__threadfence(); // Ensure data write is VISIBLE before flag write!
flag[0] = 1; // Signal ready — MUST happen AFTER data write
}
__global__ void consumer(int* data, int* flag) {
while (flag[0] == 0) {} // Spin until flag is set
__threadfence(); // Ensure we see data write, not just flag write
int value = data[0]; // Safe to read data now
printf("Got: %d\n", value); // Should see 42
}
__threadfence() vs __syncthreads():
// volatile tells the compiler: "ALWAYS read from memory, never cache!"
__global__ void spinWait(volatile int* flag) {
while (*flag == 0) {} // Without volatile, compiler might cache flag in a register!
printf("Flag set!\n");
}
// Without volatile:
// The compiler 'optimizes' by reading flag ONCE and storing in register
// If another thread changes flag, this thread NEVER SEES the change!
// volatile forces every read to go to actual memory
📋 Instructions
Simulate the fence-based producer-consumer pattern. Show how without a fence, you could read data before it's written, and with a fence, you read it correctly:
```
=== Memory Fence Simulation ===
--- WITHOUT fence (unsafe) ---
Producer: writes data=42
Consumer: reads data BEFORE fence flushes = 0 (STALE!)
--- WITH __threadfence() ---
Producer: writes data=42
Producer: __threadfence() - all writes flushed
Producer: sets flag=1
Consumer: sees flag=1, reads data=42 (CORRECT!)
```
This code is already complete — run it! The key insight is the order of operations: without a fence, the consumer might read stale (old) data. With __threadfence(), writes are guaranteed to be visible in order.