Launching a Kernel 15 XP Easy

Ctrl+Enter Run Ctrl+S Save

🚀 Chapter 2, Part 3: The Magic Triple Angle Brackets

💡 Story: Imagine you're a battle commander. You shout orders: "Delta Squadron, 32 soldiers, attack!" In CUDA, this is myKernel<<>>(). Those triple angle-brackets are your war cry!

The kernel launch syntax:

kernelName<<<gridDim, blockDim>>>(arguments);

// Example:
myKernel<<<4, 256>>>(array, n);
//          ^    ^-- 256 threads per block
//          +------- 4 blocks in the grid
// Total threads = 4 × 256 = 1024 threads launched!

The two key parameters:

📦 gridDim (1st param) — How many blocks to create. Can be an int or dim3 for multi-dimensional
🧵 blockDim (2nd param) — How many threads per block. Max is 1024 per block on modern GPUs
💡 Total threads = gridDim × blockDim
⚡ All threads start simultaneously (well, in warps of 32, but conceptually simultaneously)

#include <stdio.h>

__global__ void squareKernel(int* arr, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x; // global thread ID
    if (i < n) {
        arr[i] = arr[i] * arr[i];  // Each thread squares its element
    }
}

int main() {
    int n = 1000;
    int* d_arr;  // d_ prefix = 'device' (GPU) array — COMMON CONVENTION!
    
    // Allocate GPU memory
    cudaMalloc(&d_arr, n * sizeof(int));
    
    // Launch: enough blocks to cover n elements (rounded up)
    int threadsPerBlock = 256;
    int blocks = (n + threadsPerBlock - 1) / threadsPerBlock;  // = ceil(n/256)
    
    squareKernel<<<blocks, threadsPerBlock>>>(d_arr, n);
    cudaDeviceSynchronize();
    
    cudaFree(d_arr);
    return 0;
}

🧮 The ceiling division trick — a classic CUDA pattern:

// How many blocks do we need?
// E.g., 1000 elements, 256 threads/block:
// 1000 / 256 = 3.9... but we need a whole number → 4 blocks

int blocks = (n + threadsPerBlock - 1) / threadsPerBlock;
// = (1000 + 255) / 256 = 1255 / 256 = 4 ✓ (integer division rounds down)

// ALWAYS add the bounds check in your kernel!
if (i < n) {  // Some threads in the last block may be 'out of bounds'!
    arr[i] = ...;
}

📌 CUDA Memory 101 (quick intro):

cudaMalloc(&ptr, size) — Allocate GPU memory
cudaFree(ptr) — Free GPU memory
cudaMemcpy(dst, src, size, direction) — Copy between CPU and GPU
cudaMemcpyHostToDevice — CPU → GPU direction
cudaMemcpyDeviceToHost — GPU → CPU direction

📋 Instructions

Calculate and print the correct launch configurations for these scenarios: ``` === Kernel Launch Configurations === n=1000, threads=256: blocks=4 n=500, threads=32: blocks=16 n=10000, threads=128: blocks=79 n=1, threads=256: blocks=1 n=256, threads=256: blocks=1 ``` Write a C function `calcBlocks(int n, int threadsPerBlock)` that returns the number of blocks needed, then call it for each case.

The ceiling formula: return (n + threadsPerBlock - 1) / threadsPerBlock; — This works because adding (T-1) before dividing forces the integer division to round up.

⚠️ Try solving it yourself first — you'll learn more!

#include <stdio.h>

int calcBlocks(int n, int threadsPerBlock) {
    return (n + threadsPerBlock - 1) / threadsPerBlock;
}

int main() {
    printf("=== Kernel Launch Configurations ===\n");
    printf("n=1000, threads=256: blocks=%d\n", calcBlocks(1000, 256));
    printf("n=500, threads=32: blocks=%d\n",   calcBlocks(500, 32));
    printf("n=10000, threads=128: blocks=%d\n", calcBlocks(10000, 128));
    printf("n=1, threads=256: blocks=%d\n",     calcBlocks(1, 256));
    printf("n=256, threads=256: blocks=%d\n",   calcBlocks(256, 256));
    return 0;
}

🧪 Test Cases

Input

calcBlocks(1000, 256)

Expected

4

1000/256 rounded up

Input

calcBlocks(500, 32)

Expected

16

500/32 = 15.6 → 16

Input

calcBlocks(10000, 128)

Expected

79

10000/128 = 78.1 → 79

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›