15 carefully researched questions on kernel launches, memory management, and thread indexing. These are real NVIDIA interview topics. Score 60% to pass!
📋 Instructions
Answer all 15 questions. The explanations reveal key interview insights!
Focus on the global thread ID formula, memory transfer directions, and kernel launch syntax.
⚠️ Try solving it yourself first — you'll learn more!
# This is a quiz exercise - use the MCQ interface above!
🧠 Quiz Time
0 / 15 answered
1
What does the __global__ qualifier indicate in CUDA?
A. The function runs on the CPU and is callable from the GPU
B. The function runs on the GPU and is callable from the host (CPU) code
C. The function runs on the CPU and is callable from other CPU functions only
D. The function is a global variable accessible from both host and device
The __global__ qualifier declares a kernel function that executes on the GPU (device) but is callable from the host (CPU) code. It must return void and cannot be called from other __global__ functions (prior to CUDA Dynamic Parallelism in Compute Capability 3.5+). This is the most fundamental CUDA concept — every CUDA interview starts here.
2
What does the __device__ function qualifier do?
A. Declares a function that runs on the host and is callable from the device
B. Declares a function that runs on the device and is callable only from other device or __global__ functions
C. Declares a function that runs on both host and device simultaneously
D. Declares a variable stored in device global memory
__device__ functions execute on the GPU and can only be called from __global__ (kernel) functions or other __device__ functions. They cannot be called directly from host code. Think of them as GPU-side helper functions. The compiler inlines them aggressively for performance. In contrast, __host__ functions run on the CPU and are the default for any function without a qualifier.
3
What is the correct syntax to launch a CUDA kernel named 'vecAdd' with 256 blocks of 512 threads each?
A. vecAdd<<<256, 512>>>(args);
B. vecAdd(256, 512, args);
C. cuda_launch(vecAdd, 256, 512, args);
D. vecAdd<<<512, 256>>>(args);
The kernel launch syntax is: kernel<<<numBlocks, threadsPerBlock>>>(arguments). So vecAdd<<<256, 512>>>(args) launches 256 blocks with 512 threads each, for a total of 256 × 512 = 131,072 threads. The first parameter inside <<< >>> is the grid dimension (number of blocks), and the second is the block dimension (threads per block). Option D reverses the order — a common interview trap!
4
Given a kernel launched with <<<4, 256>>>, what is the correct formula to compute a unique global thread ID for a 1D grid?
A. threadIdx.x + blockIdx.x
B. blockIdx.x * blockDim.x + threadIdx.x
C. threadIdx.x * blockDim.x + blockIdx.x
D. blockIdx.x * gridDim.x + threadIdx.x
The global thread ID formula is: globalId = blockIdx.x * blockDim.x + threadIdx.x. For block 0: IDs are 0–255. For block 1: 1*256+0=256 to 1*256+255=511. And so on. blockIdx.x is the block index, blockDim.x is the number of threads per block (256 here), and threadIdx.x is the thread's local index within its block. This is the single most important formula in CUDA — you'll be asked this in every CUDA interview.
5
A kernel is launched with <<<8, 128>>>. How many total threads are created?
A. 8
B. 128
C. 136
D. 1024
Total threads = gridDim (number of blocks) × blockDim (threads per block) = 8 × 128 = 1,024 threads. This is a common warm-up question in CUDA interviews. Remember: the first parameter is the number of blocks, the second is threads per block. The total thread count determines how many elements you can process in parallel (before needing grid-stride loops).
6
What is the correct way to allocate 1000 floats in GPU global memory?
A. float *d_arr; cudaMalloc(&d_arr, 1000);
B. float *d_arr; cudaMalloc((void**)&d_arr, 1000 * sizeof(float));
C. float *d_arr = new float[1000]; // on GPU
D. float *d_arr; malloc(d_arr, 1000 * sizeof(float));
cudaMalloc((void**)&d_arr, size_in_bytes) allocates memory on the GPU. You must pass the address of the pointer (&d_arr) cast to void**, and specify the size in bytes (not number of elements). So 1000 floats = 1000 * sizeof(float) = 4000 bytes. Option A forgets sizeof(float) and would only allocate 1000 bytes. cudaMalloc returns cudaError_t for error checking.
7
Which cudaMemcpyKind value should be used to copy data FROM the host TO the device?
A. cudaMemcpyDeviceToHost
B. cudaMemcpyHostToDevice
C. cudaMemcpyDeviceToDevice
D. cudaMemcpyHostToHost
cudaMemcpyHostToDevice copies data from CPU (host) memory to GPU (device) memory. This is used before a kernel launch to transfer input data to the GPU. After the kernel finishes, you use cudaMemcpyDeviceToHost to copy results back. The enum values are: cudaMemcpyHostToHost (0), cudaMemcpyHostToDevice (1), cudaMemcpyDeviceToHost (2), cudaMemcpyDeviceToDevice (3). Interviewers often test if you know the correct direction.
8
What happens if you try to dereference a device pointer (allocated with cudaMalloc) directly in host code?
A. It works fine — CUDA automatically handles the memory transfer
B. It returns zero for all values
C. It causes a segmentation fault or access violation because the pointer addresses GPU memory, not CPU memory
D. It silently copies the data to host memory
Device pointers point to GPU memory addresses which are not accessible from CPU code (without Unified Memory). Attempting to dereference a device pointer on the host causes a segmentation fault / access violation. You MUST use cudaMemcpy to transfer data between host and device. This is a classic debugging question — many beginners make this mistake. With CUDA Unified Memory (cudaMallocManaged), the runtime handles page migration, but standard cudaMalloc pointers are device-only.
9
Why is cudaDeviceSynchronize() often needed after a kernel launch?
A. It allocates memory for the kernel on the device
B. Kernel launches are asynchronous — the host continues executing without waiting for the kernel to finish, so synchronization is needed before reading results
C. It copies the kernel code from host to device
D. It is required to compile the kernel at runtime
CUDA kernel launches are asynchronous with respect to the host. When you call kernel<<<...>>>(), control returns immediately to the CPU while the GPU executes in the background. cudaDeviceSynchronize() blocks the host thread until all previously launched kernels complete. You need this before reading results back with cudaMemcpy (though cudaMemcpy itself is synchronous and implicitly waits). It's also essential for accurate timing with CPU timers.
10
What does cudaGetLastError() return?
A. The number of errors that occurred during the last kernel launch
B. A string describing the last error
C. A cudaError_t value representing the last error, and resets the error state to cudaSuccess
D. A boolean indicating whether an error occurred
cudaGetLastError() returns the last error as a cudaError_t enum value and resets the internal error variable to cudaSuccess. Use cudaGetErrorString(err) to convert it to a human-readable string. The common pattern is: kernel<<<...>>>(); cudaError_t err = cudaGetLastError(); if (err != cudaSuccess) { printf("%s\n", cudaGetErrorString(err)); }. There's also cudaPeekAtLastError() which returns the error WITHOUT resetting it.
11
What is the maximum number of threads per block on most modern NVIDIA GPUs (Compute Capability 2.0+)?
A. 256
B. 512
C. 1024
D. 2048
The maximum threads per block is 1024 for Compute Capability 2.0 and above (Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, Hopper, Blackwell). For CC 1.x (Tesla), it was 512. If you launch a kernel with more than 1024 threads per block, the kernel will fail to launch and cudaGetLastError() will return cudaErrorInvalidConfiguration. This is a very common interview question!
12
If you need to process an array of N = 1000 elements and launch blocks of 256 threads, how many blocks should you launch to ensure all elements are covered?
A. 3 blocks (3 × 256 = 768 — not enough but closest)
B. 4 blocks (using the ceiling division formula: (N + blockSize - 1) / blockSize)
C. 1000 blocks (one block per element)
D. 2 blocks (one for each half of the array)
The ceiling division formula is: numBlocks = (N + blockSize - 1) / blockSize = (1000 + 255) / 256 = 1255 / 256 = 4 (integer division). This gives 4 × 256 = 1024 threads, which covers all 1000 elements. The extra 24 threads must be guarded with: if (tid < N) { ... }. This is the standard pattern for handling arrays that aren't exact multiples of the block size. Every CUDA developer must know this formula!
13
What is the correct sequence of CUDA operations for a typical GPU computation?
A. Launch kernel → Allocate device memory → Copy results to host → Free memory
B. Allocate device memory → Copy input to device → Launch kernel → Copy results to host → Free device memory
C. Copy input to device → Launch kernel → Allocate device memory → Free memory
D. Allocate device memory → Launch kernel → Copy input to device → Copy results to host
The standard CUDA workflow is: (1) cudaMalloc — allocate GPU memory, (2) cudaMemcpy Host→Device — transfer input data, (3) kernel<<<...>>>() — launch computation, (4) cudaMemcpy Device→Host — retrieve results, (5) cudaFree — release GPU memory. This pattern is sometimes called the 'CUDA lifecycle'. Forgetting any step is a common source of bugs. Some interviewers call this the '5-step CUDA dance'.
14
Which combination of qualifiers allows a function to be compiled for BOTH host and device execution?
A. __global__ only
B. __host__ __device__
C. __device__ only
D. __shared__
Using __host__ __device__ together tells the CUDA compiler (nvcc) to generate two versions of the function — one for the CPU and one for the GPU. This is useful for utility functions (e.g., math helpers) that you want to call from both host code and kernel code. __global__ functions are device-only (callable from host but execute on device). __shared__ is a memory qualifier, not a function qualifier.
15
What happens if a kernel is launched with an invalid configuration, such as <<<0, 256>>> or <<<1, 2048>>>?
A. The kernel runs with default configuration values
B. The program crashes immediately with a runtime exception
C. The kernel silently fails to launch — no GPU code executes, and the error can only be detected by checking cudaGetLastError()
D. The CUDA driver automatically adjusts the configuration to valid values
Invalid kernel launch configurations (like 0 blocks, or >1024 threads per block) cause a silent launch failure. The host code continues executing as if nothing happened, but no kernel code runs on the GPU. The error is only detectable by calling cudaGetLastError(), which returns cudaErrorInvalidConfiguration. This is why error checking after every kernel launch is critical! Many CUDA bugs stem from silent launch failures that go unchecked. Interviewers test this to see if you understand CUDA's asynchronous error model.