CUDA Programming Your First CUDA Kernel
💡
Exercise 9

Knowing Your Thread ID 15 XP Easy

Ctrl+Enter Run Ctrl+S Save

🆔 Chapter 2, Part 4: Every Thread Knows Who It Is

💡 Story: Your 1,024 GPU soldiers are deployed. How does each soldier know which part of the battlefield to cover? Simple — each soldier has a unique ID badge! In CUDA, that badge is the global thread index.

CUDA provides built-in variables available inside every kernel:

// Available inside any __global__ kernel: threadIdx.x // Thread's position within its BLOCK (0-based) blockIdx.x // Which BLOCK this thread belongs to (0-based) blockDim.x // Total threads per block gridDim.x // Total number of blocks in the grid // Computing the GLOBAL thread ID (most important formula in CUDA!): int globalId = threadIdx.x + blockIdx.x * blockDim.x; // ^ position ^ which ^ how big // in block block? is each block?

📖 Visualizing it:

// Launch: <<<3, 4>>> = 3 blocks, 4 threads each = 12 threads total // // Block 0: Thread 0, 1, 2, 3 → Global IDs: 0, 1, 2, 3 // Block 1: Thread 0, 1, 2, 3 → Global IDs: 4, 5, 6, 7 // Block 2: Thread 0, 1, 2, 3 → Global IDs: 8, 9, 10, 11 // // Formula: globalId = threadIdx.x + blockIdx.x * blockDim.x // // Block 1, Thread 2: globalId = 2 + 1 * 4 = 6 ✓ // Block 2, Thread 3: globalId = 3 + 2 * 4 = 11 ✓
#include <stdio.h> __global__ void showThreadInfo() { int globalId = threadIdx.x + blockIdx.x * blockDim.x; printf("Block %d, Thread %d -> Global ID: %d\n", blockIdx.x, // which block threadIdx.x, // thread within block globalId // global unique ID ); } int main() { showThreadInfo<<<3, 4>>>(); // 3 blocks, 4 threads each cudaDeviceSynchronize(); return 0; }

🔑 Why this formula matters so much: When you have an array of n elements and you launch n threads, each thread uses its globalId as an array index — thread 0 processes element 0, thread 1 processes element 1, etc. This is the foundation of ALL CUDA programs!

// Pattern used in EVERY CUDA program: __global__ void processArray(float* arr, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; // MEMORIZE this! if (i < n) { // Safety check (last block may have extra threads) arr[i] = arr[i] * 2.0f; // Thread i processes element i } }
📋 Instructions
Given a kernel launch configuration, compute the global thread IDs. Write a C program that simulates the CUDA thread ID computation (no actual GPU needed). For a launch of **2 blocks, 4 threads per block**, print what each thread's global ID would be: ``` === Thread ID Simulation === Block 0, Thread 0: Global ID = 0 Block 0, Thread 1: Global ID = 1 Block 0, Thread 2: Global ID = 2 Block 0, Thread 3: Global ID = 3 Block 1, Thread 0: Global ID = 4 Block 1, Thread 1: Global ID = 5 Block 1, Thread 2: Global ID = 6 Block 1, Thread 3: Global ID = 7 ``` Use nested loops to simulate it!
The formula is: int globalId = threadId + blockId * threadsPerBlock; This is exactly what CUDA's threadIdx.x + blockIdx.x * blockDim.x computes on the GPU!
⚠️ Try solving it yourself first — you'll learn more!
#include <stdio.h>

int main() {
    int numBlocks = 2;
    int threadsPerBlock = 4;
    printf("=== Thread ID Simulation ===\n");
    for (int blockId = 0; blockId < numBlocks; blockId++) {
        for (int threadId = 0; threadId < threadsPerBlock; threadId++) {
            int globalId = threadId + blockId * threadsPerBlock;
            printf("Block %d, Thread %d: Global ID = %d\n", blockId, threadId, globalId);
        }
    }
    return 0;
}
🧪 Test Cases
Input
Block 0, Thread 0
Expected
Global ID = 0
First thread
Input
Block 1, Thread 0
Expected
Global ID = 4
First thread of block 1
Input
Block 1, Thread 3
Expected
Global ID = 7
Last thread
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.