__syncthreads() 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🚦 Chapter 5, Part 2: __syncthreads() — The Traffic Light

💡 Story: Imagine all soldiers must cross a bridge, but it can only hold 10 at a time. The General shouts STOP — everyone waits. When all are assembled, the General shouts GO — everyone crosses simultaneously. That's __syncthreads() — a barrier that STOPS all threads until everyone arrives.

#include <cuda_runtime.h>

__global__ void syncExample(int* data, int n) {
    __shared__ int tile[256];  // Shared memory buffer
    int tid = threadIdx.x;
    
    // PHASE 1: Each thread loads data into shared memory
    tile[tid] = data[tid];  // Reads from slow global memory
    
    // ===== THE TRAFFIC LIGHT =====
    __syncthreads();  // ALL threads MUST arrive here before ANY proceed!
    // Without __syncthreads, Thread 0 might already be reading tile[1]
    // before Thread 1 has written to tile[1]!
    
    // PHASE 2: Now we can safely read from shared memory
    // (We know all threads finished writing in Phase 1)
    int neighbor = (tid > 0) ? tile[tid - 1] : 0;  // Read neighbor's data
    data[tid] = tile[tid] + neighbor;  // Compute something useful
    
    // ===== ANOTHER BARRIER =====
    __syncthreads();  // Separate PHASE 2 reads from PHASE 3 writes
}
// Rule: If WRITING shared memory → sync → READING shared memory  
// Always sync BETWEEN phases!

__syncthreads() rules:

✅ Works within a block — Synchronizes ALL threads in the same block
❌ NOT between blocks — Cannot sync across blocks in one kernel
⚠️ Must be reached by ALL threads — Never put it inside an if() that only some threads take!
💀 Deadlock risk — If some threads skip __syncthreads() (e.g., inside if/else), others wait forever!
🔧 Also flushes — Acts as a memory fence, ensuring writes are visible

// ❌ WRONG — DEADLOCK! Some threads skip the sync
__global__ void deadlockBug(int* data) {
    if (threadIdx.x < 16) {
        // only 16 threads enter here
        __syncthreads();  // ← DEADLOCK! Threads 16-31 never reach this!
    }
}

// ✅ CORRECT — All threads always reach the sync
__global__ void correct(int* data, bool* flags) {
    __shared__ float tile[256];
    tile[threadIdx.x] = threadIdx.x;  // Everyone writes
    __syncthreads();                   // Everyone syncs (NO conditionals around it!)
    // Now everyone can safely read any element
    data[threadIdx.x] = tile[(threadIdx.x + 1) % 256];
}

📋 Instructions

Simulate a prefix sum (running total) using a two-phase approach that requires synchronization. Print the before and after states: ``` === __syncthreads() Simulation === Phase 1: All threads load data into shared memory Shared memory after load: 1 2 3 4 5 6 7 8 --- __syncthreads() called --- Phase 2: Each thread adds its left neighbor Result: 1 3 5 7 9 11 13 15 ```

For the left neighbor: result[t] = shared[t] + (t > 0 ? shared[t-1] : 0); This adds the current value to the previous one. Thread 0 has no left neighbor so adds 0.

⚠️ Try solving it yourself first — you'll learn more!

#include <stdio.h>

int main() {
    int global_data[] = {1, 2, 3, 4, 5, 6, 7, 8};
    int shared[8];
    int result[8];
    int n = 8;
    printf("=== __syncthreads() Simulation ===\n");
    printf("Phase 1: All threads load data into shared memory\n");
    for (int t = 0; t < n; t++) shared[t] = global_data[t];
    printf("Shared memory after load:");
    for (int i = 0; i < n; i++) printf(" %d", shared[i]);
    printf("\n");
    printf("--- __syncthreads() called ---\n");
    printf("Phase 2: Each thread adds its left neighbor\n");
    for (int t = 0; t < n; t++) result[t] = shared[t] + (t > 0 ? shared[t-1] : 0);
    printf("Result:");
    for (int i = 0; i < n; i++) printf(" %d", result[i]);
    printf("\n");
    return 0;
}

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›