CUDA Programming Sync or Chaos!
💡
Exercise 22

__syncthreads() 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🚦 Chapter 5, Part 2: __syncthreads() — The Traffic Light

💡 Story: Imagine all soldiers must cross a bridge, but it can only hold 10 at a time. The General shouts STOP — everyone waits. When all are assembled, the General shouts GO — everyone crosses simultaneously. That's __syncthreads() — a barrier that STOPS all threads until everyone arrives.

#include <cuda_runtime.h> __global__ void syncExample(int* data, int n) { __shared__ int tile[256]; // Shared memory buffer int tid = threadIdx.x; // PHASE 1: Each thread loads data into shared memory tile[tid] = data[tid]; // Reads from slow global memory // ===== THE TRAFFIC LIGHT ===== __syncthreads(); // ALL threads MUST arrive here before ANY proceed! // Without __syncthreads, Thread 0 might already be reading tile[1] // before Thread 1 has written to tile[1]! // PHASE 2: Now we can safely read from shared memory // (We know all threads finished writing in Phase 1) int neighbor = (tid > 0) ? tile[tid - 1] : 0; // Read neighbor's data data[tid] = tile[tid] + neighbor; // Compute something useful // ===== ANOTHER BARRIER ===== __syncthreads(); // Separate PHASE 2 reads from PHASE 3 writes } // Rule: If WRITING shared memory → sync → READING shared memory // Always sync BETWEEN phases!

__syncthreads() rules:

  • Works within a block — Synchronizes ALL threads in the same block
  • NOT between blocks — Cannot sync across blocks in one kernel
  • ⚠️ Must be reached by ALL threads — Never put it inside an if() that only some threads take!
  • 💀 Deadlock risk — If some threads skip __syncthreads() (e.g., inside if/else), others wait forever!
  • 🔧 Also flushes — Acts as a memory fence, ensuring writes are visible
// ❌ WRONG — DEADLOCK! Some threads skip the sync __global__ void deadlockBug(int* data) { if (threadIdx.x < 16) { // only 16 threads enter here __syncthreads(); // ← DEADLOCK! Threads 16-31 never reach this! } } // ✅ CORRECT — All threads always reach the sync __global__ void correct(int* data, bool* flags) { __shared__ float tile[256]; tile[threadIdx.x] = threadIdx.x; // Everyone writes __syncthreads(); // Everyone syncs (NO conditionals around it!) // Now everyone can safely read any element data[threadIdx.x] = tile[(threadIdx.x + 1) % 256]; }
📋 Instructions
Simulate a prefix sum (running total) using a two-phase approach that requires synchronization. Print the before and after states: ``` === __syncthreads() Simulation === Phase 1: All threads load data into shared memory Shared memory after load: 1 2 3 4 5 6 7 8 --- __syncthreads() called --- Phase 2: Each thread adds its left neighbor Result: 1 3 5 7 9 11 13 15 ```
For the left neighbor: result[t] = shared[t] + (t > 0 ? shared[t-1] : 0); This adds the current value to the previous one. Thread 0 has no left neighbor so adds 0.
⚠️ Try solving it yourself first — you'll learn more!
#include <stdio.h>

int main() {
    int global_data[] = {1, 2, 3, 4, 5, 6, 7, 8};
    int shared[8];
    int result[8];
    int n = 8;
    printf("=== __syncthreads() Simulation ===\n");
    printf("Phase 1: All threads load data into shared memory\n");
    for (int t = 0; t < n; t++) shared[t] = global_data[t];
    printf("Shared memory after load:");
    for (int i = 0; i < n; i++) printf(" %d", shared[i]);
    printf("\n");
    printf("--- __syncthreads() called ---\n");
    printf("Phase 2: Each thread adds its left neighbor\n");
    for (int t = 0; t < n; t++) result[t] = shared[t] + (t > 0 ? shared[t-1] : 0);
    printf("Result:");
    for (int i = 0; i < n; i++) printf(" %d", result[i]);
    printf("\n");
    return 0;
}
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.