15 questions on concurrent GPU execution — a senior-level interview topic that shows you understand production CUDA programming!
📋 Instructions
Answer all 15 questions on CUDA streams, async operations, and concurrency.
Streams = ordered queues. Pinned memory = required for async. Events = GPU timestamps. Overlap needs different streams!
⚠️ Try solving it yourself first — you'll learn more!
# This is a quiz exercise - use the MCQ interface above!
🧠 Quiz Time
0 / 15 answered
1
What is a CUDA stream?
A) A memory buffer allocated on the GPU for storing intermediate results
B) A sequence of operations that execute in order on the GPU
C) A hardware unit inside the GPU that processes floating-point instructions
D) A CPU thread that manages GPU kernel launches
A CUDA stream is a sequence of operations (kernel launches, memory copies, etc.) that execute in the order they are issued — but operations in different streams can run concurrently. Streams are the fundamental abstraction for expressing concurrency in CUDA.
2
What is the behavior of the default stream (stream 0) under the legacy default?
A) It runs concurrently with all other streams without any synchronization
B) It synchronizes with all other streams — no other stream can run while stream 0 has pending work
C) It only synchronizes with streams that share the same GPU context
D) It has the highest priority and preempts all other streams
Under the legacy default, the default stream (stream 0) is an implicit synchronization barrier. An operation in the default stream will not begin until all previously issued operations in any stream have completed, and no subsequent operation in any stream will begin until the default stream operation finishes. This is why using the default stream can serialize your GPU work.
3
Which pair of API calls correctly creates and destroys a CUDA stream?
A) cudaStreamAlloc() / cudaStreamFree()
B) cudaStreamCreate() / cudaStreamDestroy()
C) cudaCreateStream() / cudaDeleteStream()
D) cudaStreamInit() / cudaStreamRelease()
The correct CUDA Runtime API calls are cudaStreamCreate(&stream) to create a stream and cudaStreamDestroy(stream) to destroy it. These manage the lifetime of non-default streams used for concurrent operations.
4
What is the PRIMARY requirement for overlapping a host-to-device memory transfer with kernel execution?
A) Both operations must be issued to the same stream
B) The kernel must use shared memory exclusively
C) The transfer and kernel must be in different streams, and the transfer must use pinned (page-locked) host memory
D) The GPU must have at least two SMs available
To overlap data transfer with computation you need two things: (1) the transfer and kernel must be in different streams so they can execute concurrently, and (2) the host memory must be pinned (page-locked) because asynchronous DMA transfers require pinned memory. Without both conditions, the operations will serialize.
5
Why does asynchronous memory transfer (cudaMemcpyAsync) require pinned (page-locked) host memory?
A) Pinned memory is faster to allocate than pageable memory
B) The GPU's DMA engine needs a guaranteed physical address that won't be paged out by the OS
C) Pinned memory is automatically cached on the GPU L2 cache
D) The CUDA driver cannot track pageable memory allocations
DMA (Direct Memory Access) transfers between host and device bypass the CPU. The DMA engine operates on physical addresses and needs a guarantee that the memory will not be paged out (moved to disk) by the OS during the transfer. Pinned memory provides this guarantee because it is locked in physical RAM, giving the DMA engine a stable physical address to read from or write to.
6
Which function allocates pinned (page-locked) host memory in CUDA?
A) cudaMalloc()
B) cudaMallocManaged()
C) cudaMallocHost()
D) cudaHostRegister() only
cudaMallocHost() (and equivalently cudaHostAlloc()) allocates pinned host memory. cudaMalloc() allocates device memory. cudaMallocManaged() allocates unified memory. cudaHostRegister() pins already-allocated pageable memory but does not allocate new memory itself.
7
What is the key difference between cudaMemcpy and cudaMemcpyAsync?
A) cudaMemcpy copies data between GPUs, while cudaMemcpyAsync copies between host and device
B) cudaMemcpy is synchronous and blocks the host thread; cudaMemcpyAsync is non-blocking and returns control immediately (when used with pinned memory and a non-default stream)
C) cudaMemcpyAsync is always faster because it uses compression
D) cudaMemcpy uses the PCIe bus while cudaMemcpyAsync uses NVLink
cudaMemcpy is a synchronous call — the host thread blocks until the copy completes. cudaMemcpyAsync is asynchronous — it returns control to the host immediately after queuing the transfer, allowing the CPU to continue with other work. However, the host memory must be pinned for truly asynchronous behavior; with pageable memory, cudaMemcpyAsync may fall back to synchronous behavior.
8
Which API call records a CUDA event into a stream?
A) cudaEventCreate(stream, event)
B) cudaEventRecord(event, stream)
C) cudaEventInsert(event, stream)
D) cudaStreamRecord(stream, event)
cudaEventRecord(event, stream) records (places) an event into the specified stream. When all operations previously enqueued in that stream have completed, the event is 'recorded' and its timestamp is captured. The event must first be created with cudaEventCreate().
9
How do you measure elapsed GPU time between two CUDA events?
A) float ms = cudaEventDiff(start, stop);
B) cudaEventElapsedTime(&ms, start, stop); — returns time in milliseconds
C) cudaEventQuery(start) - cudaEventQuery(stop);
D) cudaGetElapsedTime(start, stop, &ms); — returns time in microseconds
cudaEventElapsedTime(&milliseconds, startEvent, stopEvent) computes the elapsed time in milliseconds between two recorded events. Both events must have been recorded and completed. This is the standard and most accurate way to time GPU operations because it measures actual GPU execution time, not wall-clock time.
10
What is the difference between cudaStreamSynchronize() and cudaDeviceSynchronize()?
A) cudaStreamSynchronize waits for one specific stream to finish; cudaDeviceSynchronize waits for ALL streams on the device to finish
B) They are identical — both wait for all GPU work to complete
C) cudaDeviceSynchronize only waits for the default stream; cudaStreamSynchronize waits for any stream
D) cudaStreamSynchronize is non-blocking; cudaDeviceSynchronize is blocking
cudaStreamSynchronize(stream) blocks the host thread until all operations in the specified stream have completed — other streams may still be running. cudaDeviceSynchronize() blocks until ALL previously issued work across ALL streams on the current device has completed. Use cudaStreamSynchronize for fine-grained control; use cudaDeviceSynchronize when you need a full device barrier.
11
What does the cudaStreamNonBlocking flag do when creating a stream?
A) It makes the stream ignore all synchronization calls including cudaDeviceSynchronize
B) It prevents the stream from doing implicit synchronization with the default stream (stream 0)
C) It allows the stream to execute kernels without occupying any SMs
D) It disables error checking for all operations in the stream
When you create a stream with cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking), that stream will NOT implicitly synchronize with the default stream. Normally (legacy behavior), work issued to the default stream acts as a barrier for all other streams. The NonBlocking flag lets the stream operate independently from the default stream, enabling more concurrency.
12
In a classic pipeline pattern for overlapping H2D transfer, kernel execution, and D2H transfer, how should work be organized?
A) Issue all H2D copies first, then all kernels, then all D2H copies — each in one stream
B) Divide data into chunks; for each chunk use a separate stream and issue H2D, kernel, D2H in sequence within that stream — interleaving chunks across streams
C) Use one stream and call cudaMemcpyAsync for all operations — the driver handles overlap automatically
D) Launch all kernels first, then copy results back — transfers cannot overlap with compute
The pipeline pattern divides data into chunks and assigns each chunk to a separate stream. Within each stream, operations execute in order (H2D → kernel → D2H), but across streams, operations can overlap. For example, stream 1's kernel can run while stream 2's H2D transfer is happening and stream 0's D2H transfer is finishing. This maximizes GPU utilization by keeping both the copy engines and compute units busy.
13
What is Hyper-Q in NVIDIA GPUs?
A) A memory compression technique that doubles effective bandwidth
B) A technology providing multiple hardware work queues (up to 32) allowing streams to map to independent queues and achieve true concurrency
C) A scheduling algorithm that prioritizes small kernels over large ones
D) A feature that automatically doubles the number of CUDA cores available
Hyper-Q (introduced with Kepler / compute capability 3.5) provides multiple hardware work queues per GPU (up to 32). Before Hyper-Q, all streams funneled into a single hardware queue, creating false dependencies. With Hyper-Q, different streams can map to different hardware queues, enabling true concurrent execution of kernels from different streams without false serialization.
14
Which function is used to select a specific GPU in a multi-GPU system before issuing CUDA operations?
A) cudaChooseDevice()
B) cudaSetDevice()
C) cudaSelectGPU()
D) cudaDeviceEnable()
cudaSetDevice(int deviceId) sets the current GPU for the calling host thread. All subsequent CUDA calls (memory allocations, kernel launches, stream creation, etc.) will target that device until cudaSetDevice is called again. cudaChooseDevice() exists but selects a device based on desired properties — it's not used to switch the active device.
15
What happens if you call cudaMemcpyAsync with pageable (non-pinned) host memory?
A) The call fails immediately and returns cudaErrorInvalidValue
B) The transfer still occurs but behaves synchronously — the host thread blocks until the copy is done, negating the async benefit
C) The data is silently corrupted because DMA cannot access pageable memory
D) The CUDA runtime automatically pins the memory permanently for future transfers
If pageable memory is passed to cudaMemcpyAsync, the CUDA runtime falls back to a synchronous copy path internally. It must first copy the data to an internal pinned staging buffer, then DMA from there. This blocks the host thread, effectively making it behave like cudaMemcpy. No error is returned, but you lose the concurrency benefits of async transfers.