💡

Exercise 15

🧱 Quiz: Thread Hierarchy 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🏆 Chapter 3 Quiz — Thread Hierarchy Mastery!

15 questions on CUDA's thread organization model. This is one of the most commonly tested areas in GPU programming interviews!

📋 Instructions

Answer all 15 questions about threads, blocks, grids, and warps.

Remember: warp = 32 threads, max 1024 threads/block, dim3 for multi-dimensional.

⚠️ Try solving it yourself first — you'll learn more!

# This is a quiz exercise - use the MCQ interface above!

🧠 Quiz Time

0 / 15 answered

What is the correct hierarchy of CUDA thread organization, from smallest to largest?

A) Block → Thread → Grid → Warp

B) Thread → Warp → Block → Grid

C) Warp → Thread → Block → Grid

D) Thread → Block → Warp → Grid

The CUDA thread hierarchy from smallest to largest is: Thread → Warp (32 threads) → Block (up to 1024 threads) → Grid (all blocks in a kernel launch). A warp is a hardware scheduling unit of 32 consecutive threads within a block, and a grid is the collection of all blocks launched by a single kernel invocation.

How many threads are in a single warp on all current NVIDIA GPU architectures?

A) 16

B) 32

C) 64

D) 128

A warp consists of exactly 32 threads on all NVIDIA GPU architectures to date (from Fermi through Hopper and beyond). These 32 threads execute in lockstep on a Streaming Multiprocessor (SM), meaning they all execute the same instruction at the same time. This is the fundamental unit of execution scheduling on NVIDIA GPUs.

What is the maximum number of threads allowed per block on modern NVIDIA GPUs (Compute Capability ≥ 2.0)?

A) 256

B) 512

C) 1024

D) 2048

Since Compute Capability 2.0 (Fermi architecture and later), the maximum number of threads per block is 1024. This means a single block can contain up to 1024 / 32 = 32 warps. Earlier architectures (CC 1.x) had a limit of 512 threads per block. This is a very commonly tested fact in CUDA interviews.

For a 1D array of 1000 elements processed with 256 threads per block, how many blocks must be launched?

A) 3

B) 4

C) 3.90625

D) 5

The number of blocks needed is ceil(1000 / 256) = ceil(3.90625) = 4. You must always round up because the last block will still have 256 threads, but only 1000 - 3×256 = 1000 - 768 = 232 threads will do useful work. The remaining 24 threads in the last block will be idle (you guard against out-of-bounds access with an if-check). The common idiom is: int blocks = (N + blockSize - 1) / blockSize;

What is the correct formula to compute the global thread index in a 2D grid for the row direction?

A) row = blockIdx.x * blockDim.x + threadIdx.x

B) row = blockIdx.y * blockDim.y + threadIdx.y

C) row = gridDim.y * blockDim.y + threadIdx.y

D) row = blockIdx.y * gridDim.y + threadIdx.y

In a 2D grid, the row index is computed along the y-dimension: row = blockIdx.y * blockDim.y + threadIdx.y. Similarly, the column index uses the x-dimension: col = blockIdx.x * blockDim.x + threadIdx.x. This convention maps naturally to matrix operations where rows go vertically (y) and columns go horizontally (x). This is one of the most frequently asked indexing questions in interviews.

What is the purpose of the dim3 struct in CUDA?

A) It defines the size of shared memory

B) It specifies 3D dimensions for grid and block configurations

C) It allocates 3-dimensional arrays in device memory

D) It sets the number of registers per thread

The dim3 struct is a CUDA built-in type used to specify multi-dimensional grid and block sizes in a kernel launch: kernel<<<gridDim, blockDim>>>(...). It has three fields: x, y, and z, with unspecified dimensions defaulting to 1. For example, dim3 block(16, 16) creates a 2D block of 16×16 = 256 threads, with z implicitly set to 1. This allows natural mapping of threads to 2D/3D data structures.

A kernel is launched with dim3 block(8, 8, 4). How many threads are in each block?

A) 64

B) 128

C) 256

D) 512

The total number of threads per block is blockDim.x × blockDim.y × blockDim.z = 8 × 8 × 4 = 256. When using 3D block dimensions, all three components multiply together to give the total thread count. This must not exceed 1024 (the maximum threads per block on modern GPUs). Here 256 ≤ 1024, so this is a valid configuration.

Which of the following statements about threads across different blocks is TRUE?

A) Threads in different blocks can synchronize using __syncthreads()

B) Threads in different blocks can share data via shared memory

C) Threads in different blocks cannot synchronize within a single kernel launch (without cooperative groups)

D) Threads in different blocks always execute on the same SM

Threads in different blocks cannot synchronize with each other during a kernel execution using standard CUDA primitives. __syncthreads() only synchronizes threads within the same block, and shared memory is only visible within a block. Cross-block synchronization traditionally requires launching a new kernel (kernel boundary acts as an implicit global barrier). CUDA 9+ introduced Cooperative Groups with grid-level sync, but this is a special opt-in feature, not the default behavior.

A kernel is launched with <<<4, 256>>>. How many warps are created in total?

A) 4

B) 8

C) 16

D) 32

Total threads = 4 blocks × 256 threads/block = 1024 threads. Each warp contains 32 threads, so total warps = 1024 / 32 = 32 warps. Each block of 256 threads contains 256/32 = 8 warps, and there are 4 blocks, giving 4 × 8 = 32 warps total. Understanding warp counts is critical for reasoning about occupancy and performance.

What happens to extra threads in the last block when the data size is not a multiple of the block size?

A) They cause a runtime error

B) CUDA automatically disables them

C) They execute but the programmer must guard against out-of-bounds access

D) The kernel refuses to launch

CUDA always launches complete blocks — there is no such thing as a partial block. If you have 1000 elements and 256 threads per block, the 4th block will have 256 threads but only 232 elements to process. The extra 24 threads will still execute, so the programmer MUST add a bounds check like: if (idx < N) { ... }. Without this guard, those threads would access invalid memory, causing undefined behavior or crashes.

What is warp divergence?

A) When warps from different blocks execute simultaneously

B) When threads within the same warp take different execution paths at a branch

C) When a warp is split across two different SMs

D) When warp size changes dynamically at runtime

Warp divergence occurs when threads within the same warp follow different paths at a conditional branch (e.g., an if-else statement). Since all 32 threads in a warp execute in lockstep (SIMT model), both paths must be executed serially — threads not on the active path are masked/disabled. This effectively reduces parallelism and can significantly hurt performance. Minimizing warp divergence is a key CUDA optimization strategy.

How are blocks scheduled onto Streaming Multiprocessors (SMs)?

A) All blocks are assigned to a single SM

B) Blocks are distributed across SMs by the hardware scheduler; multiple blocks can run on one SM

C) Each block must run on its own dedicated SM

D) The programmer explicitly assigns blocks to specific SMs

The CUDA runtime's hardware block scheduler distributes blocks across available SMs automatically. Multiple blocks can be resident on a single SM simultaneously (limited by the SM's resources: registers, shared memory, max threads, max blocks). The programmer has no direct control over which SM a block runs on. This automatic scheduling is what gives CUDA its scalability — the same code runs on GPUs with different SM counts.

What is occupancy in the context of CUDA programming?

A) The percentage of global memory being used

B) The ratio of active warps to the maximum number of warps an SM can support

C) The number of kernels running simultaneously

D) The percentage of threads that perform useful work

Occupancy is defined as the ratio of active warps per SM to the maximum number of warps the SM can support. For example, if an SM supports up to 64 warps and your kernel configuration results in 32 active warps on that SM, the occupancy is 32/64 = 50%. Higher occupancy generally helps hide memory latency through warp switching, though maximum occupancy doesn't always yield maximum performance. Factors limiting occupancy include registers per thread, shared memory per block, and threads per block.

A 2D grid is launched with dim3 grid(10, 8) and dim3 block(16, 16). What is the total number of threads?

A) 10,240

B) 20,480

C) 40,960

D) 80,000

Total threads = (gridDim.x × gridDim.y) × (blockDim.x × blockDim.y) = (10 × 8) × (16 × 16) = 80 blocks × 256 threads/block = 20,480 threads. In a 2D configuration, the total block count is the product of the grid's x and y dimensions, and the total threads per block is the product of the block's x and y dimensions. This 2D layout is commonly used for image processing and matrix operations.

Which built-in variable gives the dimensions (number of blocks) of the grid?

A) blockDim

B) gridDim

C) threadIdx

D) blockIdx

gridDim is the built-in dim3 variable that contains the dimensions of the grid (i.e., how many blocks in each dimension). blockDim gives the dimensions of each block (threads per block in each dimension). threadIdx gives the thread's index within its block, and blockIdx gives the block's index within the grid. These four built-in variables — threadIdx, blockIdx, blockDim, gridDim — are the foundation of all CUDA thread indexing and are critical to master for interviews.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›