🚀 Chapter 2, Part 3: The Magic Triple Angle Brackets
💡 Story: Imagine you're a battle commander. You shout orders: "Delta Squadron, 32 soldiers, attack!" In CUDA, this is myKernel<<>>(). Those triple angle-brackets are your war cry!
The kernel launch syntax:
kernelName<<<gridDim, blockDim>>>(arguments);
// Example:
myKernel<<<4, 256>>>(array, n);
// ^ ^-- 256 threads per block
// +------- 4 blocks in the grid
// Total threads = 4 × 256 = 1024 threads launched!
#include <stdio.h>
__global__ void squareKernel(int* arr, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x; // global thread ID
if (i < n) {
arr[i] = arr[i] * arr[i]; // Each thread squares its element
}
}
int main() {
int n = 1000;
int* d_arr; // d_ prefix = 'device' (GPU) array — COMMON CONVENTION!
// Allocate GPU memory
cudaMalloc(&d_arr, n * sizeof(int));
// Launch: enough blocks to cover n elements (rounded up)
int threadsPerBlock = 256;
int blocks = (n + threadsPerBlock - 1) / threadsPerBlock; // = ceil(n/256)
squareKernel<<<blocks, threadsPerBlock>>>(d_arr, n);
cudaDeviceSynchronize();
cudaFree(d_arr);
return 0;
}
🧮 The ceiling division trick — a classic CUDA pattern:
// How many blocks do we need?
// E.g., 1000 elements, 256 threads/block:
// 1000 / 256 = 3.9... but we need a whole number → 4 blocks
int blocks = (n + threadsPerBlock - 1) / threadsPerBlock;
// = (1000 + 255) / 256 = 1255 / 256 = 4 ✓ (integer division rounds down)
// ALWAYS add the bounds check in your kernel!
if (i < n) { // Some threads in the last block may be 'out of bounds'!
arr[i] = ...;
}
📌 CUDA Memory 101 (quick intro):
📋 Instructions
Calculate and print the correct launch configurations for these scenarios:
```
=== Kernel Launch Configurations ===
n=1000, threads=256: blocks=4
n=500, threads=32: blocks=16
n=10000, threads=128: blocks=79
n=1, threads=256: blocks=1
n=256, threads=256: blocks=1
```
Write a C function `calcBlocks(int n, int threadsPerBlock)` that returns the number of blocks needed, then call it for each case.
The ceiling formula: return (n + threadsPerBlock - 1) / threadsPerBlock; — This works because adding (T-1) before dividing forces the integer division to round up.
⚠️ Try solving it yourself first — you'll learn more!
#include <stdio.h>
int calcBlocks(int n, int threadsPerBlock) {
return (n + threadsPerBlock - 1) / threadsPerBlock;
}
int main() {
printf("=== Kernel Launch Configurations ===\n");
printf("n=1000, threads=256: blocks=%d\n", calcBlocks(1000, 256));
printf("n=500, threads=32: blocks=%d\n", calcBlocks(500, 32));
printf("n=10000, threads=128: blocks=%d\n", calcBlocks(10000, 128));
printf("n=1, threads=256: blocks=%d\n", calcBlocks(1, 256));
printf("n=256, threads=256: blocks=%d\n", calcBlocks(256, 256));
return 0;
}