🌐 Chapter 3, Part 2: The Grid — Command the Whole Army
💡 A Grid is the entire deployment. When you launch myKernel<<<16, 256>>>(), you deploy a grid of 16 blocks. The grid is the top-level organization of your parallel computation.
Grids can be 1D, 2D, or 3D! You use a special type called dim3:
#include <cuda_runtime.h>
// 1D grid (most common for arrays)
dim3 grid1D(num_blocks); // same as <<<num_blocks, threads>>>
// 2D grid (great for image processing)
dim3 grid2D(blocks_x, blocks_y); // Arranges blocks in a 2D pattern
dim3 block2D(16, 16); // 16x16 = 256 threads/block
// Launch: kernel<<<grid2D, block2D>>>()
// Inside a 2D kernel:
__global__ void process2D(float* img, int width, int height) {
int col = threadIdx.x + blockIdx.x * blockDim.x; // x coordinate
int row = threadIdx.y + blockIdx.y * blockDim.y; // y coordinate
if (col < width && row < height) {
int idx = row * width + col; // 2D → 1D index conversion
img[idx] *= 1.5f; // Brighten!
}
}
dim3 — The dimension struct:
// 2D image processing — Classic CUDA pattern
// For an 800x600 image:
dim3 blockSize(16, 16); // 256 threads/block
dim3 gridSize(800/16, 600/16); // 50 × 37.5 → round up
// Correct way:
dim3 gridSize((800+15)/16, (600+15)/16); // = (50, 38)
// Total blocks = 50 × 38 = 1900
// Total threads = 1900 × 256 = 486,400
// Compare to 800×600 = 480,000 pixels → close enough, with bounds check!
📋 Instructions
Print grid configurations for different problem sizes. For each image size, compute the required grid dimensions using 16×16 thread blocks:
```
=== 2D Grid Configurations (16x16 blocks) ===
Image 800x600: grid 50x38, total blocks=1900
Image 1920x1080: grid 120x68, total blocks=8160
Image 256x256: grid 16x16, total blocks=256
Image 100x100: grid 7x7, total blocks=49
```
The code is already nearly complete! Just run it — the formula (width + blockSize - 1) / blockSize correctly computes the ceiling division for grid dimensions.