💡

Exercise 25

🔒 Quiz: Synchronization 25 XP Medium

Ctrl+Enter Run Ctrl+S Save

🏆 Chapter 5 Quiz — Synchronization Mastery!

15 tricky questions on race conditions, barriers, and atomics. Interviewers LOVE these because they reveal deep understanding of parallel programming!

📋 Instructions

Answer all 15 questions on synchronization primitives and parallel hazards.

__syncthreads = block-level barrier. Atomics = thread-safe single operations. Never put __syncthreads in divergent branches!

⚠️ Try solving it yourself first — you'll learn more!

# This is a quiz exercise - use the MCQ interface above!

🧠 Quiz Time

0 / 15 answered

What is a race condition in CUDA?

A) A compiler error when two kernels are launched simultaneously

B) When multiple threads access shared data concurrently and the result depends on execution order

C) When a GPU runs faster than the CPU can supply data

D) A deadlock caused by too many threads in a block

A race condition occurs when multiple threads read and write shared data concurrently without proper synchronization, and the final result depends on the nondeterministic order in which the threads execute. This is a fundamental parallel programming hazard — not a compiler error, not a speed issue, and not the same thing as a deadlock.

What is the scope of __syncthreads()?

A) It synchronizes all threads across the entire GPU

B) It synchronizes all threads within a single warp

C) It synchronizes all threads within a single thread block

D) It synchronizes all threads within a single SM

__syncthreads() is a block-level barrier. It ensures that all threads within the same thread block have reached the barrier before any thread in the block proceeds past it. It does NOT synchronize across different blocks — there is no built-in mechanism to synchronize all blocks on the GPU (except kernel completion or cooperative groups). Warp-level sync uses different primitives like __syncwarp().

What happens if __syncthreads() is placed inside a conditional branch where only some threads in the block enter?

A) Only the threads that enter the branch are synchronized

B) The compiler automatically moves __syncthreads() outside the branch

C) Undefined behavior — the program may deadlock

D) A runtime error is thrown and the kernel terminates

If __syncthreads() is placed in a conditional branch that not all threads in the block take, the behavior is UNDEFINED. Threads that reach __syncthreads() will wait forever for the threads that never reach it, potentially causing a deadlock or producing incorrect results. The CUDA programming guide explicitly states that __syncthreads() must be reached by ALL threads in the block, or the result is undefined behavior. The compiler does not fix this for you.

Which atomic operation can be used to implement ANY other atomic operation (e.g., atomicAdd for doubles)?

A) atomicExch

B) atomicAdd

C) atomicCAS (Compare-And-Swap)

D) atomicMin

atomicCAS (Compare-And-Swap) is the universal building block for custom atomic operations. To implement atomicAdd for doubles (which wasn't natively supported before sm_60), you use a CAS loop: read the old value, compute the desired new value, then atomicCAS to swap only if the value hasn't changed. If another thread modified it in between, retry. atomicExch can only unconditionally replace values — it can't conditionally update, which is why CAS is the fundamental primitive.

What is the main performance cost of atomic operations?

A) They consume extra registers per thread

B) They serialize access to the same memory location, reducing parallelism

C) They require double the memory bandwidth

D) They force all warps in the grid to synchronize

Atomic operations serialize access to the same memory location — when many threads atomically update the same address, they must take turns one at a time. This destroys parallelism and creates a bottleneck. The severity depends on contention: if 1000 threads all call atomicAdd on the same counter, they effectively execute sequentially. Techniques like privatization (per-thread/per-block partial results) reduce contention. Atomics don't consume extra registers or force grid-wide synchronization.

What is the difference between cudaDeviceSynchronize() and __syncthreads()?

A) They are identical — both wait for all GPU threads to finish

B) cudaDeviceSynchronize() is called from host and waits for all device work; __syncthreads() is called from device and synchronizes threads within a block

C) cudaDeviceSynchronize() syncs threads within a block; __syncthreads() syncs the host with the device

D) cudaDeviceSynchronize() is deprecated and replaced by __syncthreads()

cudaDeviceSynchronize() is a HOST-side API call that blocks the CPU thread until ALL previously issued CUDA operations (kernels, memcpys) on the device have completed. __syncthreads() is a DEVICE-side intrinsic called within a kernel that acts as a barrier for threads within the same block. They operate at completely different levels: host-device synchronization vs. intra-block thread synchronization.

What does __threadfence() do?

A) It acts as a barrier that blocks all threads in the grid until they reach the fence

B) It ensures that all writes by the calling thread to global and shared memory are visible to all other threads in the device before the thread continues

C) It prevents the compiler from reordering any instructions in the kernel

D) It flushes the L2 cache to guarantee memory coherence

__threadfence() is a memory fence (not a barrier). It ensures that all writes to global and shared memory made by the calling thread before the fence are visible to all other threads on the device before the calling thread's subsequent memory accesses. It does NOT block other threads or act as a synchronization barrier — only the calling thread waits. __threadfence_block() provides the same guarantee but only within the block, and __threadfence_system() extends visibility to the host and other devices.

Which warp-level primitive returns a bitmask where each bit indicates whether the corresponding lane's predicate is true?

A) __shfl_sync()

B) __any_sync()

C) __ballot_sync()

D) __all_sync()

__ballot_sync(mask, predicate) evaluates the predicate for each active thread in the warp (specified by mask) and returns a 32-bit integer where bit i is set if thread i's predicate was non-zero. __any_sync returns 1 if ANY thread's predicate is true. __all_sync returns 1 if ALL threads' predicates are true. __shfl_sync exchanges values between lanes, not predicates. __ballot_sync is the one that gives you the full per-lane bitmask.

When do you NOT need synchronization between threads?

A) When each thread reads and writes only to its own unique memory location

B) When threads write to the same global memory location

C) When threads share data through shared memory in a two-phase pattern

D) When threads in the same warp need to exchange values

If every thread accesses completely independent memory locations (no shared data), there are no data races and no synchronization is needed. This is the ideal parallel pattern. Options B, C, and D all involve shared data access: writing to the same location (B) needs atomics or coordination, shared memory multi-phase access (C) needs __syncthreads() between phases, and warp-level data exchange (D) uses shuffle intrinsics which are themselves synchronization primitives.

What does the 'volatile' keyword do to a variable in CUDA device code?

A) It makes the variable atomic — all accesses are thread-safe

B) It forces every read/write to go to memory instead of being cached in registers, ensuring visibility to other threads

C) It prevents the variable from being modified by any thread

D) It allocates the variable in constant memory for faster read access

Declaring a variable as 'volatile' in CUDA tells the compiler that the value may be changed by other threads, so every access must go directly to memory (shared or global) rather than being optimized into a register. This ensures other threads see the latest written value. However, volatile does NOT provide atomicity — two threads can still race. It also doesn't make the variable read-only or move it to constant memory. It's commonly used in warp-level programming patterns to avoid register caching issues.

What is a read-after-write (RAW) hazard in CUDA?

A) A thread reads a memory location before another thread's write to that location has completed, getting a stale value

B) Two threads write to the same location at the same time

C) A thread reads from uninitialized memory

D) The host reads GPU memory before the kernel has finished

A RAW hazard occurs when Thread B reads a memory location that Thread A is supposed to write to, but Thread B reads BEFORE Thread A's write is visible. Thread B thus gets a stale/old value instead of the updated one. This is the most common data hazard in parallel programming. Option B describes a write-write conflict, Option C is uninitialized memory access, and Option D is a host-device sync issue (solved by cudaDeviceSynchronize). RAW hazards are resolved with proper barriers or memory fences.

What does atomicExch(address, val) return?

A) The new value (val) that was written

B) The old value that was previously stored at the address

C) 0 on success, -1 on failure

D) A boolean indicating whether the swap happened

atomicExch(address, val) atomically stores val at *address and returns the OLD value that was previously stored there. This 'return old value' pattern is consistent across all CUDA atomic operations (atomicAdd, atomicMin, atomicMax, atomicCAS, etc.) — they all return the old value found at the address. This is critical for algorithms like lock-free data structures where you need to know what the previous state was. It doesn't return a success/failure code.

What is the difference between __threadfence_block() and __threadfence()?

A) __threadfence_block() is faster because it only guarantees visibility within the same block; __threadfence() guarantees visibility across all blocks on the device

B) __threadfence_block() synchronizes all threads in a block; __threadfence() synchronizes all threads on the device

C) There is no difference — they are aliases for the same operation

D) __threadfence_block() operates on shared memory only; __threadfence() operates on global memory only

__threadfence_block() ensures that all writes by the calling thread are visible to all threads within the same block before the thread continues. __threadfence() provides a stronger guarantee: visibility to ALL threads on the device (across all blocks). Because __threadfence_block() has a smaller scope, it is cheaper/faster. Neither one is a barrier — they don't block other threads, they only ensure ordering for the calling thread's own memory operations. And both work on all memory types the thread can access, not just shared or global exclusively.

In cooperative groups (CUDA 9+), how can you synchronize ALL threads across the entire grid?

A) Call __syncthreads() from every block — it automatically extends to grid scope

B) Use cooperative_groups::this_grid().sync() after launching the kernel with cudaLaunchCooperativeKernel

C) Call cudaDeviceSynchronize() inside the kernel

D) Use atomicAdd on a global counter until it reaches gridDim.x * blockDim.x

Cooperative groups introduced in CUDA 9 provide grid-level synchronization. You create a grid group with cooperative_groups::this_grid() and call .sync() on it. The kernel must be launched with cudaLaunchCooperativeKernel (not the <<<>>> syntax) and the GPU must support it. __syncthreads() NEVER extends beyond a single block. cudaDeviceSynchronize() is a host-side function and cannot be called inside a kernel. Using atomicAdd as a manual barrier is fragile, non-portable, and doesn't provide proper memory ordering guarantees.

A kernel uses shared memory in two phases: Phase 1 writes data, Phase 2 reads it. What is the correct synchronization pattern?

A) No synchronization needed — shared memory is coherent within a block

B) Place __syncthreads() between Phase 1 and Phase 2 to ensure all writes complete before any reads begin

C) Use atomicAdd for every shared memory write in Phase 1

D) Use __threadfence() between Phase 1 and Phase 2

The standard pattern for multi-phase shared memory usage is to place __syncthreads() between the write phase and the read phase. This guarantees that ALL threads in the block have finished writing before ANY thread starts reading — preventing RAW hazards. Shared memory is NOT automatically coherent across threads without a barrier; without __syncthreads(), a fast thread could read before a slow thread has written. Atomics would be massive overkill and slow. __threadfence() only ensures ordering for the calling thread — it doesn't wait for other threads to finish their writes.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›