🆔 Chapter 2, Part 4: Every Thread Knows Who It Is
💡 Story: Your 1,024 GPU soldiers are deployed. How does each soldier know which part of the battlefield to cover? Simple — each soldier has a unique ID badge! In CUDA, that badge is the global thread index.
CUDA provides built-in variables available inside every kernel:
// Available inside any __global__ kernel:
threadIdx.x // Thread's position within its BLOCK (0-based)
blockIdx.x // Which BLOCK this thread belongs to (0-based)
blockDim.x // Total threads per block
gridDim.x // Total number of blocks in the grid
// Computing the GLOBAL thread ID (most important formula in CUDA!):
int globalId = threadIdx.x + blockIdx.x * blockDim.x;
// ^ position ^ which ^ how big
// in block block? is each block?
// Launch: <<<3, 4>>> = 3 blocks, 4 threads each = 12 threads total
//
// Block 0: Thread 0, 1, 2, 3 → Global IDs: 0, 1, 2, 3
// Block 1: Thread 0, 1, 2, 3 → Global IDs: 4, 5, 6, 7
// Block 2: Thread 0, 1, 2, 3 → Global IDs: 8, 9, 10, 11
//
// Formula: globalId = threadIdx.x + blockIdx.x * blockDim.x
//
// Block 1, Thread 2: globalId = 2 + 1 * 4 = 6 ✓
// Block 2, Thread 3: globalId = 3 + 2 * 4 = 11 ✓
#include <stdio.h>
__global__ void showThreadInfo() {
int globalId = threadIdx.x + blockIdx.x * blockDim.x;
printf("Block %d, Thread %d -> Global ID: %d\n",
blockIdx.x, // which block
threadIdx.x, // thread within block
globalId // global unique ID
);
}
int main() {
showThreadInfo<<<3, 4>>>(); // 3 blocks, 4 threads each
cudaDeviceSynchronize();
return 0;
}
🔑 Why this formula matters so much: When you have an array of n elements and you launch n threads, each thread uses its globalId as an array index — thread 0 processes element 0, thread 1 processes element 1, etc. This is the foundation of ALL CUDA programs!
// Pattern used in EVERY CUDA program:
__global__ void processArray(float* arr, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x; // MEMORIZE this!
if (i < n) { // Safety check (last block may have extra threads)
arr[i] = arr[i] * 2.0f; // Thread i processes element i
}
}
📋 Instructions
Given a kernel launch configuration, compute the global thread IDs.
Write a C program that simulates the CUDA thread ID computation (no actual GPU needed). For a launch of **2 blocks, 4 threads per block**, print what each thread's global ID would be:
```
=== Thread ID Simulation ===
Block 0, Thread 0: Global ID = 0
Block 0, Thread 1: Global ID = 1
Block 0, Thread 2: Global ID = 2
Block 0, Thread 3: Global ID = 3
Block 1, Thread 0: Global ID = 4
Block 1, Thread 1: Global ID = 5
Block 1, Thread 2: Global ID = 6
Block 1, Thread 3: Global ID = 7
```
Use nested loops to simulate it!
The formula is: int globalId = threadId + blockId * threadsPerBlock; This is exactly what CUDA's threadIdx.x + blockIdx.x * blockDim.x computes on the GPU!
⚠️ Try solving it yourself first — you'll learn more!
#include <stdio.h>
int main() {
int numBlocks = 2;
int threadsPerBlock = 4;
printf("=== Thread ID Simulation ===\n");
for (int blockId = 0; blockId < numBlocks; blockId++) {
for (int threadId = 0; threadId < threadsPerBlock; threadId++) {
int globalId = threadId + blockId * threadsPerBlock;
printf("Block %d, Thread %d: Global ID = %d\n", blockId, threadId, globalId);
}
}
return 0;
}