Global Memory 10 XP Easy

Ctrl+Enter Run Ctrl+S Save

💾 Chapter 4: GPU Memory — The Treasure Map

💡 Story: Your GPU has different types of memory — like a kingdom with different storage systems! There's the Capital's main warehouse (Global Memory), a neighborhood pantry (Shared Memory), individual soldiers' pockets (Registers), and a royal broadcast (Constant Memory). Using the right storage at the right time is the difference between a fast GPU and a slow one!

🌐 Global Memory — The Main Warehouse

📦 Size — Largest (8-80 GB on modern GPUs like H100)
🐌 Speed — Slowest of all GPU memories (~600-2000 GB/s bandwidth)
🌍 Scope — Accessible by ALL threads in ALL blocks
📝 Persistence — Exists for the duration of the program
🔧 How to use — cudaMalloc, cudaFree, cudaMemcpy

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void doubleArray(int* d_arr, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n) {
        d_arr[i] *= 2;  // Reading & writing global memory
    }
}

int main() {
    int n = 5;
    int h_arr[] = {1, 2, 3, 4, 5};  // Host (CPU) array
    int* d_arr;                       // Device (GPU) pointer
    
    // Step 1: Allocate GPU memory
    cudaMalloc(&d_arr, n * sizeof(int));           // Like malloc() but on GPU
    
    // Step 2: Copy CPU → GPU
    cudaMemcpy(d_arr, h_arr, n * sizeof(int), cudaMemcpyHostToDevice);
    
    // Step 3: Launch kernel
    doubleArray<<<1, n>>>(d_arr, n);
    cudaDeviceSynchronize();
    
    // Step 4: Copy GPU → CPU
    cudaMemcpy(h_arr, d_arr, n * sizeof(int), cudaMemcpyDeviceToHost);
    
    // Step 5: Free GPU memory
    cudaFree(d_arr);
    
    // Print results
    for (int i = 0; i < n; i++) printf("%d ", h_arr[i]);
    printf("\n");
    // Output: 2 4 6 8 10
    return 0;
}

📋 The CUDA Memory Workflow (ALWAYS in this order):

1️⃣ cudaMalloc(&d_ptr, size) — Allocate GPU memory
2️⃣ cudaMemcpy(d_ptr, h_ptr, size, H2D) — Upload data CPU→GPU
3️⃣ kernel<<>>(d_ptr) — Process on GPU
4️⃣ cudaMemcpy(h_ptr, d_ptr, size, D2H) — Download results GPU→CPU
5️⃣ cudaFree(d_ptr) — Release GPU memory

⚡ Performance tip: Global memory accesses are expensive! A cache miss can cost 600-800 clock cycles. This is why shared memory (next exercise) is so valuable — it's 100× faster!

📋 Instructions

Write a program that simulates the full CUDA memory workflow. Since we're running in a regular C environment, simulate it using host arrays and print at each step: ``` === CUDA Global Memory Workflow === Step 1: Allocate GPU memory (5 ints) Step 2: Copy to GPU: 1 2 3 4 5 Step 3: Kernel runs: doubles each element Step 4: Copy from GPU: 2 4 6 8 10 Step 5: GPU memory freed Done! ```

This program is already complete! Just look at the code structure — it shows the exact CUDA workflow: Allocate → Upload → Process → Download → Free. Run it to see the output.

← Previous Next Exercise →

main.py

Hi! I'm Rex 👋

#include <stdio.h>
#include <stdlib.h>

void simulateCUDAWorkflow() {
    int n = 5;
    int h_arr[] = {1, 2, 3, 4, 5};
    int* gpu_arr = (int*)malloc(n * sizeof(int));  // Simulate GPU memory
    
    printf("=== CUDA Global Memory Workflow ===\n");
    
    // Step 1
    printf("Step 1: Allocate GPU memory (%d ints)\n", n);
    
    // Step 2: Simulate cudaMemcpy H2D
    printf("Step 2: Copy to GPU:");
    for (int i = 0; i < n; i++) {
        gpu_arr[i] = h_arr[i];
        printf(" %d", gpu_arr[i]);
    }
    printf("\n");
    
    // Step 3: Simulate kernel (double array)
    printf("Step 3: Kernel runs: doubles each element\n");
    for (int i = 0; i < n; i++) gpu_arr[i] *= 2;
    
    // Step 4: Simulate cudaMemcpy D2H
    printf("Step 4: Copy from GPU:");
    for (int i = 0; i < n; i++) {
        h_arr[i] = gpu_arr[i];
        printf(" %d", h_arr[i]);
    }
    printf("\n");
    
    // Step 5
    free(gpu_arr);
    printf("Step 5: GPU memory freed\n");
    printf("Done!\n");
}

int main() {
    simulateCUDAWorkflow();
    return 0;
}

Output

Ready. Press ▶ Run or Ctrl+Enter.

›