CUDA Programming Your First CUDA Kernel
💡
Exercise 8

Launching a Kernel 15 XP Easy

Ctrl+Enter Run Ctrl+S Save

🚀 Chapter 2, Part 3: The Magic Triple Angle Brackets

💡 Story: Imagine you're a battle commander. You shout orders: "Delta Squadron, 32 soldiers, attack!" In CUDA, this is myKernel<<>>(). Those triple angle-brackets are your war cry!

The kernel launch syntax:

kernelName<<<gridDim, blockDim>>>(arguments); // Example: myKernel<<<4, 256>>>(array, n); // ^ ^-- 256 threads per block // +------- 4 blocks in the grid // Total threads = 4 × 256 = 1024 threads launched!

The two key parameters:

  • 📦 gridDim (1st param) — How many blocks to create. Can be an int or dim3 for multi-dimensional
  • 🧵 blockDim (2nd param) — How many threads per block. Max is 1024 per block on modern GPUs
  • 💡 Total threads = gridDim × blockDim
  • All threads start simultaneously (well, in warps of 32, but conceptually simultaneously)
#include <stdio.h> __global__ void squareKernel(int* arr, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; // global thread ID if (i < n) { arr[i] = arr[i] * arr[i]; // Each thread squares its element } } int main() { int n = 1000; int* d_arr; // d_ prefix = 'device' (GPU) array — COMMON CONVENTION! // Allocate GPU memory cudaMalloc(&d_arr, n * sizeof(int)); // Launch: enough blocks to cover n elements (rounded up) int threadsPerBlock = 256; int blocks = (n + threadsPerBlock - 1) / threadsPerBlock; // = ceil(n/256) squareKernel<<<blocks, threadsPerBlock>>>(d_arr, n); cudaDeviceSynchronize(); cudaFree(d_arr); return 0; }

🧮 The ceiling division trick — a classic CUDA pattern:

// How many blocks do we need? // E.g., 1000 elements, 256 threads/block: // 1000 / 256 = 3.9... but we need a whole number → 4 blocks int blocks = (n + threadsPerBlock - 1) / threadsPerBlock; // = (1000 + 255) / 256 = 1255 / 256 = 4 ✓ (integer division rounds down) // ALWAYS add the bounds check in your kernel! if (i < n) { // Some threads in the last block may be 'out of bounds'! arr[i] = ...; }

📌 CUDA Memory 101 (quick intro):

  • cudaMalloc(&ptr, size) — Allocate GPU memory
  • cudaFree(ptr) — Free GPU memory
  • cudaMemcpy(dst, src, size, direction) — Copy between CPU and GPU
  • cudaMemcpyHostToDevice — CPU → GPU direction
  • cudaMemcpyDeviceToHost — GPU → CPU direction
📋 Instructions
Calculate and print the correct launch configurations for these scenarios: ``` === Kernel Launch Configurations === n=1000, threads=256: blocks=4 n=500, threads=32: blocks=16 n=10000, threads=128: blocks=79 n=1, threads=256: blocks=1 n=256, threads=256: blocks=1 ``` Write a C function `calcBlocks(int n, int threadsPerBlock)` that returns the number of blocks needed, then call it for each case.
The ceiling formula: return (n + threadsPerBlock - 1) / threadsPerBlock; — This works because adding (T-1) before dividing forces the integer division to round up.
⚠️ Try solving it yourself first — you'll learn more!
#include <stdio.h>

int calcBlocks(int n, int threadsPerBlock) {
    return (n + threadsPerBlock - 1) / threadsPerBlock;
}

int main() {
    printf("=== Kernel Launch Configurations ===\n");
    printf("n=1000, threads=256: blocks=%d\n", calcBlocks(1000, 256));
    printf("n=500, threads=32: blocks=%d\n",   calcBlocks(500, 32));
    printf("n=10000, threads=128: blocks=%d\n", calcBlocks(10000, 128));
    printf("n=1, threads=256: blocks=%d\n",     calcBlocks(1, 256));
    printf("n=256, threads=256: blocks=%d\n",   calcBlocks(256, 256));
    return 0;
}
🧪 Test Cases
Input
calcBlocks(1000, 256)
Expected
4
1000/256 rounded up
Input
calcBlocks(500, 32)
Expected
16
500/32 = 15.6 → 16
Input
calcBlocks(10000, 128)
Expected
79
10000/128 = 78.1 → 79
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.