🖼️ Chapter 3, Part 4: 2D Grids — Perfect for Images & Matrices
💡 Story: You're now the AI engineer at Netflix. Your job: add a cinematic filter to every frame of a 4K movie. Each frame is 3840×2160 pixels. With 2D grids, each thread handles one pixel — all 8,294,400 pixels processed simultaneously!
#include <cuda_runtime.h>
#include <stdio.h>
// Apply grayscale to an RGB image
__global__ void grayscale(unsigned char* rgb, unsigned char* gray, int width, int height) {
// Thread maps to one pixel
int col = threadIdx.x + blockIdx.x * blockDim.x; // x pixel
int row = threadIdx.y + blockIdx.y * blockDim.y; // y pixel
if (col < width && row < height) {
int pixelIdx = row * width + col;
int rgbIdx = pixelIdx * 3; // RGB has 3 channels
unsigned char r = rgb[rgbIdx + 0];
unsigned char g = rgb[rgbIdx + 1];
unsigned char b = rgb[rgbIdx + 2];
// Luminosity formula (ITU-R BT.601)
gray[pixelIdx] = (unsigned char)(0.299f*r + 0.587f*g + 0.114f*b);
}
}
int main() {
int width = 1920, height = 1080;
// 2D block: 16x16 = 256 threads per block
dim3 blockSize(16, 16);
dim3 gridSize((width + 15) / 16, (height + 15) / 16);
// gridSize = (120, 68) → 120×68 = 8160 blocks
// Total threads = 8160 × 256 = 2,088,960 (covers all 2,073,600 pixels)
// ... (memory allocation, copy, etc.) ...
grayscale<<<gridSize, blockSize>>>(d_rgb, d_gray, width, height);
cudaDeviceSynchronize();
return 0;
}
Why 16×16 blocks (256 threads)?
🔑 Common 2D CUDA applications:
📋 Instructions
Calculate and print the 2D grid configurations for common image processing workloads:
```
=== 2D CUDA Grid Calculator ===
Block size: 16x16 (256 threads/block)
720p (1280x720): grid=80x45, blocks=3600, threads=921600
1080p (1920x1080): grid=120x68, blocks=8160, threads=2088960
4K (3840x2160): grid=240x135,blocks=32400, threads=8294400
```
This code is almost complete! The printImageConfig function already implements the grid calculation. Just run it to see the configurations. The key formula is: gridX = ceil(width/blockSize), gridY = ceil(height/blockSize).