CUDA Programming Sync or Chaos!
💡
Exercise 24

Memory Fences 20 XP Hard

Ctrl+Enter Run Ctrl+S Save

🚧 Chapter 5, Part 4: Memory Fences — Order in the Memory System

💡 Story: Soldiers are doing operations, but the memory system is like a dishonest messenger — it sometimes reorders deliveries to be more 'efficient'. Memory fences are orders to the messenger: "Deliver ALL previous messages BEFORE any new ones!"

Modern processors (including GPUs) reorder memory operations for performance. Sometimes this causes bugs — memory fences enforce ordering:

#include <cuda_runtime.h> // CUDA Memory Fence Functions: __threadfence_block(); // Fence for ALL threads in same BLOCK __threadfence(); // Fence for ALL threads on same GPU (all SMs) __threadfence_system(); // Fence for ALL threads (GPU + CPU) // Example use case: producer-consumer pattern between blocks __global__ void producer(int* data, int* flag) { data[0] = 42; // Write data __threadfence(); // Ensure data write is VISIBLE before flag write! flag[0] = 1; // Signal ready — MUST happen AFTER data write } __global__ void consumer(int* data, int* flag) { while (flag[0] == 0) {} // Spin until flag is set __threadfence(); // Ensure we see data write, not just flag write int value = data[0]; // Safe to read data now printf("Got: %d\n", value); // Should see 42 }

__threadfence() vs __syncthreads():

  • 🔄 __syncthreads() — Barrier: ALL threads STOP and WAIT at this point
  • 🚧 __threadfence() — Memory ordering: threads DON'T stop, but memory writes are flushed and visible
  • __threadfence() is non-blocking (faster) but only guarantees ordering
  • 🔒 Use __syncthreads() when you need ALL threads to finish before proceeding
  • 🚧 Use __threadfence() when you need memory writes to be visible immediately

🔬 The Volatile keyword:

// volatile tells the compiler: "ALWAYS read from memory, never cache!" __global__ void spinWait(volatile int* flag) { while (*flag == 0) {} // Without volatile, compiler might cache flag in a register! printf("Flag set!\n"); } // Without volatile: // The compiler 'optimizes' by reading flag ONCE and storing in register // If another thread changes flag, this thread NEVER SEES the change! // volatile forces every read to go to actual memory
📋 Instructions
Simulate the fence-based producer-consumer pattern. Show how without a fence, you could read data before it's written, and with a fence, you read it correctly: ``` === Memory Fence Simulation === --- WITHOUT fence (unsafe) --- Producer: writes data=42 Consumer: reads data BEFORE fence flushes = 0 (STALE!) --- WITH __threadfence() --- Producer: writes data=42 Producer: __threadfence() - all writes flushed Producer: sets flag=1 Consumer: sees flag=1, reads data=42 (CORRECT!) ```
This code is already complete — run it! The key insight is the order of operations: without a fence, the consumer might read stale (old) data. With __threadfence(), writes are guaranteed to be visible in order.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.