CUDA Programming Threads, Blocks & Grids
💡
Exercise 11

Threads & Blocks 10 XP Easy

Ctrl+Enter Run Ctrl+S Save

🏗️ Chapter 3: The Structure of Your GPU Army

💡 Story: An army isn't just thousands of random soldiers. It's organized! Soldiers form Squads (Threads → Blocks), squads form Platoons (Blocks), and platoons form a Division (Grid). This hierarchy makes commanding 10,000+ soldiers manageable!

The 3-Level CUDA Hierarchy:

  • 🧵 Thread — The smallest unit. Executes one instance of the kernel. Has its own registers.
  • 📦 Block — A group of up to 1024 threads. Threads in the same block can share memory and synchronize.
  • 🌐 Grid — The collection of ALL blocks. All blocks run the same kernel but independently.
// Visualizing the hierarchy: // // GRID (entire launch: <<<4, 256>>>) // ├── Block 0 (256 threads) // │ ├── Thread 0 // │ ├── Thread 1 // │ ├── ... // │ └── Thread 255 // ├── Block 1 (256 threads) // │ ├── Thread 0 // │ └── ... // ├── Block 2 // └── Block 3 // KEY: Thread numbering RESETS in each block! // Block 0 has Thread 0..255 // Block 1 ALSO has Thread 0..255 // That's why we need the Global ID formula!

Why have blocks at all? Why not just one giant block?

  • 💾 Shared memory is per-block — Only threads in the SAME block can share memory
  • 🔄 Synchronization is per-block — You can only sync threads within the same block
  • ⚙️ SM mapping — Blocks are assigned to Streaming Multiprocessors (SMs). A block stays on one SM for its entire life
  • 📏 Hardware limit — Max 1024 threads per block (hardware constraint)
__global__ void hierarchyDemo() { // Every thread knows its place in the hierarchy: printf("Grid: %d blocks | Block: %d | Thread: %d | Global: %d\n", gridDim.x, // Total blocks in grid blockIdx.x, // Which block am I in? threadIdx.x, // Where am I within my block? threadIdx.x + blockIdx.x * blockDim.x // My unique global ID ); } // Launch: hierarchyDemo<<<3, 4>>>();
📋 Instructions
Simulate the thread hierarchy by printing a visual tree. For a launch of 2 blocks, 3 threads each: ``` === CUDA Thread Hierarchy === GRID: 2 blocks x 3 threads = 6 total threads Block 0: Thread 0 (Global ID: 0) Thread 1 (Global ID: 1) Thread 2 (Global ID: 2) Block 1: Thread 0 (Global ID: 3) Thread 1 (Global ID: 4) Thread 2 (Global ID: 5) ```
Global ID = t + b * threadsPerBlock — same formula as threadIdx.x + blockIdx.x * blockDim.x on the actual GPU.
⚠️ Try solving it yourself first — you'll learn more!
#include <stdio.h>

int main() {
    int numBlocks = 2, threadsPerBlock = 3;
    int total = numBlocks * threadsPerBlock;
    printf("=== CUDA Thread Hierarchy ===\n");
    printf("GRID: %d blocks x %d threads = %d total threads\n", numBlocks, threadsPerBlock, total);
    for (int b = 0; b < numBlocks; b++) {
        printf("  Block %d:\n", b);
        for (int t = 0; t < threadsPerBlock; t++) {
            int globalId = t + b * threadsPerBlock;
            printf("    Thread %d (Global ID: %d)\n", t, globalId);
        }
    }
    return 0;
}
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.