💡

Exercise 5

🌌 Quiz: The GPU Universe 20 XP Easy

Ctrl+Enter Run Ctrl+S Save

🏆 Chapter 1 Quiz — GPU Universe Mastery Check!

15 carefully researched questions covering GPU architecture, CUDA fundamentals, and parallel computing concepts. These are the exact topics asked in NVIDIA, Google, Meta, and AI company interviews. Score 60% or higher to pass!

📋 Instructions

Answer all 15 questions. Review the explanations after submitting to solidify your understanding! Each question tests a key concept that interview panels look for.

Focus on GPU vs CPU differences, SIMT execution model, and CUDA toolkit components.

⚠️ Try solving it yourself first — you'll learn more!

# This is a quiz exercise - use the MCQ interface above!

🧠 Quiz Time

0 / 15 answered

What execution model does an NVIDIA GPU use?

A. MIMD (Multiple Instruction, Multiple Data)

B. SISD (Single Instruction, Single Data)

C. SIMT (Single Instruction, Multiple Threads)

D. MISD (Multiple Instruction, Single Data)

NVIDIA GPUs use the SIMT (Single Instruction, Multiple Threads) execution model. Groups of 32 threads called 'warps' execute the same instruction simultaneously on different data. SIMT is NVIDIA's evolution of SIMD — it adds independent thread scheduling and divergence handling. CPUs typically use MIMD. This is one of the most common NVIDIA interview questions!

A modern NVIDIA GPU (e.g., H100) has ~18,000 CUDA cores, while a high-end CPU (e.g., Intel i9) has ~24 cores. Why can't the GPU simply replace the CPU?

A. GPU cores are fully general-purpose and can do everything CPU cores do

B. GPU cores are simple in-order processors optimized for throughput, not complex branching or serial logic that CPUs excel at

C. GPUs lack memory so they cannot run programs

D. CPUs are always faster than GPUs at everything

GPU cores (CUDA cores / Streaming Processors) are lightweight, in-order ALUs optimized for massive parallelism and throughput. CPU cores are heavyweight out-of-order processors with deep pipelines, branch prediction, and speculative execution — ideal for complex serial workloads. Each is designed for a different type of computation. This GPU vs CPU tradeoff is fundamental to heterogeneous computing.

What does CUDA stand for?

A. Compute Unified Device Architecture

B. Common Universal Data Accelerator

C. Central Unified Data Architecture

D. Compute Universal Device Accelerator

CUDA stands for Compute Unified Device Architecture. It's NVIDIA's parallel computing platform and programming model introduced in 2006 with the GeForce 8800 GTX (Tesla architecture). CUDA provides C/C++ extensions that let developers write code for the GPU. It 'unified' vertex and pixel shaders into general-purpose compute cores.

What is a Streaming Multiprocessor (SM) in NVIDIA GPU architecture?

A. A type of GPU memory that streams data to the CPU

B. A processing cluster containing multiple CUDA cores, shared memory, registers, warp schedulers, and L1 cache

C. A single CUDA core that processes one thread

D. The bus that connects the GPU to the PCIe slot

A Streaming Multiprocessor (SM) is the fundamental compute unit of an NVIDIA GPU. Each SM contains multiple CUDA cores (e.g., 128 in Ampere), warp schedulers, a register file, shared memory, L1 cache, and special function units (SFUs). An H100 has 132 SMs. Thread blocks are assigned to SMs, and multiple blocks can run concurrently on one SM if resources allow.

According to Amdahl's Law, if 90% of a program is parallelizable, what is the maximum theoretical speedup with infinite processors?

A. 90×

B. 100×

C. 10×

D. 9×

Amdahl's Law: Speedup = 1 / (S + P/N), where S is the serial fraction, P is the parallel fraction, and N is the number of processors. With N→∞: Speedup = 1 / S = 1 / 0.10 = 10×. Even with infinite cores, the 10% serial portion limits you to 10× speedup. This is why optimizing the serial bottleneck matters more than adding more cores. NVIDIA interviewers LOVE this question!

What is the warp size on all current NVIDIA GPU architectures?

A. 16 threads

B. 64 threads

C. 32 threads

D. 128 threads

The warp size has been 32 threads on every NVIDIA architecture from Tesla (2006) through Hopper (2022) and Blackwell (2024). A warp is the fundamental scheduling unit — all 32 threads in a warp execute the same instruction in lockstep. AMD GPUs use 'wavefronts' of 64 threads (or 32 in RDNA). Warp size = 32 is one of the most critical constants in CUDA optimization.

Which CUDA toolkit component is the CUDA C/C++ compiler?

A. cuda-gdb

B. nvcc

C. nsight

D. cuBLAS

nvcc (NVIDIA CUDA Compiler) is the compiler driver that separates host (CPU) code from device (GPU) code. Device code is compiled to PTX (intermediate) and then to SASS (GPU machine code). Host code is passed to the system C++ compiler (gcc/MSVC). cuda-gdb is the debugger, Nsight is the profiling suite, and cuBLAS is a linear algebra library.

A GPU has 900 GB/s memory bandwidth and 1.5 GHz core clock. A CPU has 50 GB/s memory bandwidth and 5 GHz clock. Which statement is TRUE?

A. The CPU is faster at everything because it has a higher clock speed

B. The GPU excels at bandwidth-bound workloads (e.g., matrix ops), while the CPU excels at latency-sensitive serial tasks

C. Memory bandwidth doesn't affect compute performance

D. The GPU's lower clock speed means it computes slower per-core and overall

GPUs trade single-thread latency for massive throughput and memory bandwidth. The GPU's 900 GB/s bandwidth is 18× the CPU's 50 GB/s — making it dominant for data-parallel workloads like matrix multiplication, convolutions, and reductions. The CPU's 5 GHz clock and out-of-order execution make it better for serial, branch-heavy code. Understanding bandwidth vs. latency tradeoffs is essential for CUDA optimization.

What is 'compute capability' in the CUDA ecosystem?

A. The total FLOPS a GPU can deliver

B. A version number (e.g., 8.0, 9.0) that defines the hardware features and instruction set supported by an NVIDIA GPU

C. The maximum number of threads a GPU can run

D. The CUDA driver version installed on the system

Compute capability is a major.minor version (e.g., Ampere A100 = 8.0, Hopper H100 = 9.0) that specifies which hardware features the GPU supports: max threads per block, shared memory size, FP64 throughput, tensor core support, etc. You compile CUDA code targeting a specific compute capability with nvcc flags like -arch=sm_80. It's NOT the same as CUDA toolkit version.

Which of the following is NOT a component of the CUDA Toolkit?

A. cuDNN — Deep Neural Network library

B. cuBLAS — GPU-accelerated BLAS library

C. TensorRT — inference optimization engine

D. OpenCL — cross-platform parallel framework

OpenCL is an open standard by the Khronos Group — it's NOT part of the CUDA Toolkit and works across multiple GPU vendors (NVIDIA, AMD, Intel). The CUDA Toolkit includes: nvcc compiler, cuBLAS (linear algebra), cuFFT (FFT), cuDNN (deep learning), cuRAND (random numbers), Thrust (C++ parallel algorithms), Nsight (profiling), and CUDA runtime/driver APIs. TensorRT is distributed separately but is part of the NVIDIA CUDA ecosystem.

In which scenario would a CPU likely OUTPERFORM a GPU?

A. Multiplying two 4096×4096 matrices

B. Training a deep learning model on a large dataset

C. Executing a complex recursive algorithm with heavy branching and small data

D. Applying a convolution filter across a 4K image

CPUs outperform GPUs on workloads that are serial, branch-heavy, have irregular memory access patterns, or operate on small datasets where the overhead of transferring data to the GPU exceeds the compute benefit. Recursive algorithms with heavy branching cause warp divergence on GPUs and don't expose enough parallelism. Matrix multiplication, deep learning, and image processing are classic GPU-friendly workloads.

Match the NVIDIA GPU architecture to its generation: The architecture that introduced Tensor Cores for AI acceleration.

A. Pascal (2016)

B. Volta (2017)

C. Turing (2018)

D. Fermi (2010)

Volta (2017, V100) was the first architecture to introduce Tensor Cores — specialized hardware for mixed-precision matrix multiply-accumulate (MMA) operations that dramatically accelerated deep learning. Pascal had no tensor cores. Turing added RT cores for ray tracing alongside 2nd-gen Tensor Cores. Ampere (3rd-gen), Hopper (4th-gen with FP8), and Blackwell continue the evolution.

What is the primary purpose of the L2 cache on an NVIDIA GPU?

A. It stores the operating system kernel code

B. It acts as a shared cache across all SMs to reduce global memory access latency and bandwidth pressure

C. It replaces shared memory in modern architectures

D. It is only used for texture data

The L2 cache sits between the SMs and global (HBM/GDDR) memory. It's shared across ALL SMs and caches frequently accessed global memory data to reduce off-chip memory traffic. The H100 has 50 MB of L2 cache. Each SM also has its own L1 cache (shared with shared memory in configurable partitions). L2 does NOT replace shared memory — they serve different roles in the memory hierarchy.

You have a task that processes 10 million independent pixels. Each pixel operation takes 100 nanoseconds on a CPU core. On a GPU with 10,000 CUDA cores (with overhead), each pixel takes 500ns but all pixels run in parallel batches. Which approach is faster overall?

A. CPU — because each pixel is processed faster (100ns vs 500ns per pixel)

B. GPU — despite higher per-element latency, massive parallelism gives far greater throughput than a single CPU core

C. They're about the same

D. Cannot be determined without knowing the memory bandwidth

CPU (1 core): 10,000,000 × 100ns = 1 second. GPU (10,000 cores): 10,000,000 / 10,000 batches × 500ns = 1,000 × 500ns = 0.5 ms. The GPU is ~2000× faster despite each core being 5× slower per element. This illustrates the throughput vs. latency tradeoff: GPUs hide latency with massive parallelism. This is why GPUs dominate image processing, deep learning, and scientific computing.

Which of the following correctly describes the CUDA execution hierarchy from smallest to largest unit?

A. Grid → Warp → Thread Block → Thread

B. Thread → Warp → Thread Block → Grid

C. Warp → Thread → Thread Block → Grid

D. Thread → Thread Block → Warp → Grid

The CUDA execution hierarchy from smallest to largest: Thread (runs on one CUDA core) → Warp (32 threads executing in lockstep) → Thread Block (up to 1024 threads, assigned to one SM, shares shared memory) → Grid (all thread blocks launched by one kernel call). A CUDA core executes one thread's instruction per clock cycle. Understanding this hierarchy is fundamental to writing efficient CUDA code.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›