Threads & Blocks 10 XP Easy

Ctrl+Enter Run Ctrl+S Save

🏗️ Chapter 3: The Structure of Your GPU Army

💡 Story: An army isn't just thousands of random soldiers. It's organized! Soldiers form Squads (Threads → Blocks), squads form Platoons (Blocks), and platoons form a Division (Grid). This hierarchy makes commanding 10,000+ soldiers manageable!

The 3-Level CUDA Hierarchy:

🧵 Thread — The smallest unit. Executes one instance of the kernel. Has its own registers.
📦 Block — A group of up to 1024 threads. Threads in the same block can share memory and synchronize.
🌐 Grid — The collection of ALL blocks. All blocks run the same kernel but independently.

// Visualizing the hierarchy:
//
// GRID (entire launch: <<<4, 256>>>)
// ├── Block 0 (256 threads)
// │   ├── Thread 0
// │   ├── Thread 1
// │   ├── ...
// │   └── Thread 255
// ├── Block 1 (256 threads)
// │   ├── Thread 0
// │   └── ...
// ├── Block 2
// └── Block 3

// KEY: Thread numbering RESETS in each block!
// Block 0 has Thread 0..255
// Block 1 ALSO has Thread 0..255
// That's why we need the Global ID formula!

Why have blocks at all? Why not just one giant block?

💾 Shared memory is per-block — Only threads in the SAME block can share memory
🔄 Synchronization is per-block — You can only sync threads within the same block
⚙️ SM mapping — Blocks are assigned to Streaming Multiprocessors (SMs). A block stays on one SM for its entire life
📏 Hardware limit — Max 1024 threads per block (hardware constraint)

__global__ void hierarchyDemo() {
    // Every thread knows its place in the hierarchy:
    printf("Grid: %d blocks | Block: %d | Thread: %d | Global: %d\n",
        gridDim.x,    // Total blocks in grid
        blockIdx.x,   // Which block am I in?
        threadIdx.x,  // Where am I within my block?
        threadIdx.x + blockIdx.x * blockDim.x  // My unique global ID
    );
}
// Launch: hierarchyDemo<<<3, 4>>>();

📋 Instructions

Simulate the thread hierarchy by printing a visual tree. For a launch of 2 blocks, 3 threads each: ``` === CUDA Thread Hierarchy === GRID: 2 blocks x 3 threads = 6 total threads Block 0: Thread 0 (Global ID: 0) Thread 1 (Global ID: 1) Thread 2 (Global ID: 2) Block 1: Thread 0 (Global ID: 3) Thread 1 (Global ID: 4) Thread 2 (Global ID: 5) ```

Global ID = t + b * threadsPerBlock — same formula as threadIdx.x + blockIdx.x * blockDim.x on the actual GPU.

⚠️ Try solving it yourself first — you'll learn more!

#include <stdio.h>

int main() {
    int numBlocks = 2, threadsPerBlock = 3;
    int total = numBlocks * threadsPerBlock;
    printf("=== CUDA Thread Hierarchy ===\n");
    printf("GRID: %d blocks x %d threads = %d total threads\n", numBlocks, threadsPerBlock, total);
    for (int b = 0; b < numBlocks; b++) {
        printf("  Block %d:\n", b);
        for (int t = 0; t < threadsPerBlock; t++) {
            int globalId = t + b * threadsPerBlock;
            printf("    Thread %d (Global ID: %d)\n", t, globalId);
        }
    }
    return 0;
}

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

Output

Ready. Press ▶ Run or Ctrl+Enter.

›