🚦 Chapter 5, Part 2: __syncthreads() — The Traffic Light
💡 Story: Imagine all soldiers must cross a bridge, but it can only hold 10 at a time. The General shouts STOP — everyone waits. When all are assembled, the General shouts GO — everyone crosses simultaneously. That's __syncthreads() — a barrier that STOPS all threads until everyone arrives.
#include <cuda_runtime.h>
__global__ void syncExample(int* data, int n) {
__shared__ int tile[256]; // Shared memory buffer
int tid = threadIdx.x;
// PHASE 1: Each thread loads data into shared memory
tile[tid] = data[tid]; // Reads from slow global memory
// ===== THE TRAFFIC LIGHT =====
__syncthreads(); // ALL threads MUST arrive here before ANY proceed!
// Without __syncthreads, Thread 0 might already be reading tile[1]
// before Thread 1 has written to tile[1]!
// PHASE 2: Now we can safely read from shared memory
// (We know all threads finished writing in Phase 1)
int neighbor = (tid > 0) ? tile[tid - 1] : 0; // Read neighbor's data
data[tid] = tile[tid] + neighbor; // Compute something useful
// ===== ANOTHER BARRIER =====
__syncthreads(); // Separate PHASE 2 reads from PHASE 3 writes
}
// Rule: If WRITING shared memory → sync → READING shared memory
// Always sync BETWEEN phases!
// ❌ WRONG — DEADLOCK! Some threads skip the sync
__global__ void deadlockBug(int* data) {
if (threadIdx.x < 16) {
// only 16 threads enter here
__syncthreads(); // ← DEADLOCK! Threads 16-31 never reach this!
}
}
// ✅ CORRECT — All threads always reach the sync
__global__ void correct(int* data, bool* flags) {
__shared__ float tile[256];
tile[threadIdx.x] = threadIdx.x; // Everyone writes
__syncthreads(); // Everyone syncs (NO conditionals around it!)
// Now everyone can safely read any element
data[threadIdx.x] = tile[(threadIdx.x + 1) % 256];
}
For the left neighbor: result[t] = shared[t] + (t > 0 ? shared[t-1] : 0); This adds the current value to the previous one. Thread 0 has no left neighbor so adds 0.
⚠️ Try solving it yourself first — you'll learn more!
#include <stdio.h>
int main() {
int global_data[] = {1, 2, 3, 4, 5, 6, 7, 8};
int shared[8];
int result[8];
int n = 8;
printf("=== __syncthreads() Simulation ===\n");
printf("Phase 1: All threads load data into shared memory\n");
for (int t = 0; t < n; t++) shared[t] = global_data[t];
printf("Shared memory after load:");
for (int i = 0; i < n; i++) printf(" %d", shared[i]);
printf("\n");
printf("--- __syncthreads() called ---\n");
printf("Phase 2: Each thread adds its left neighbor\n");
for (int t = 0; t < n; t++) result[t] = shared[t] + (t > 0 ? shared[t-1] : 0);
printf("Result:");
for (int i = 0; i < n; i++) printf(" %d", result[i]);
printf("\n");
return 0;
}