CUDA Programming Threads, Blocks & Grids
💡
Exercise 14

2D Grids & Blocks 20 XP Medium

Ctrl+Enter Run Ctrl+S Save

🖼️ Chapter 3, Part 4: 2D Grids — Perfect for Images & Matrices

💡 Story: You're now the AI engineer at Netflix. Your job: add a cinematic filter to every frame of a 4K movie. Each frame is 3840×2160 pixels. With 2D grids, each thread handles one pixel — all 8,294,400 pixels processed simultaneously!

#include <cuda_runtime.h> #include <stdio.h> // Apply grayscale to an RGB image __global__ void grayscale(unsigned char* rgb, unsigned char* gray, int width, int height) { // Thread maps to one pixel int col = threadIdx.x + blockIdx.x * blockDim.x; // x pixel int row = threadIdx.y + blockIdx.y * blockDim.y; // y pixel if (col < width && row < height) { int pixelIdx = row * width + col; int rgbIdx = pixelIdx * 3; // RGB has 3 channels unsigned char r = rgb[rgbIdx + 0]; unsigned char g = rgb[rgbIdx + 1]; unsigned char b = rgb[rgbIdx + 2]; // Luminosity formula (ITU-R BT.601) gray[pixelIdx] = (unsigned char)(0.299f*r + 0.587f*g + 0.114f*b); } } int main() { int width = 1920, height = 1080; // 2D block: 16x16 = 256 threads per block dim3 blockSize(16, 16); dim3 gridSize((width + 15) / 16, (height + 15) / 16); // gridSize = (120, 68) → 120×68 = 8160 blocks // Total threads = 8160 × 256 = 2,088,960 (covers all 2,073,600 pixels) // ... (memory allocation, copy, etc.) ... grayscale<<<gridSize, blockSize>>>(d_rgb, d_gray, width, height); cudaDeviceSynchronize(); return 0; }

Why 16×16 blocks (256 threads)?

  • 🔲 256 = 8 warps of 32 threads each — warp-aligned (efficient!)
  • 📊 16×16 tiles fit nicely in shared memory for tiling optimizations
  • ⚙️ Good occupancy — Each SM can run multiple 256-thread blocks
  • 🏆 Industry standard — 16×16 is the go-to for 2D CUDA problems

🔑 Common 2D CUDA applications:

  • 🖼️ Image filtering, convolution (CNNs!), color space conversion
  • 🧮 Matrix operations (multiply, transpose, addition)
  • 🌊 2D simulations (fluid dynamics, heat diffusion)
  • 🎮 Game physics, collision detection
📋 Instructions
Calculate and print the 2D grid configurations for common image processing workloads: ``` === 2D CUDA Grid Calculator === Block size: 16x16 (256 threads/block) 720p (1280x720): grid=80x45, blocks=3600, threads=921600 1080p (1920x1080): grid=120x68, blocks=8160, threads=2088960 4K (3840x2160): grid=240x135,blocks=32400, threads=8294400 ```
This code is almost complete! The printImageConfig function already implements the grid calculation. Just run it to see the configurations. The key formula is: gridX = ceil(width/blockSize), gridY = ceil(height/blockSize).
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.