CUDA Programming GPU Memory — The Treasure Map
💡
Exercise 16

Global Memory 10 XP Easy

Ctrl+Enter Run Ctrl+S Save

💾 Chapter 4: GPU Memory — The Treasure Map

💡 Story: Your GPU has different types of memory — like a kingdom with different storage systems! There's the Capital's main warehouse (Global Memory), a neighborhood pantry (Shared Memory), individual soldiers' pockets (Registers), and a royal broadcast (Constant Memory). Using the right storage at the right time is the difference between a fast GPU and a slow one!

🌐 Global Memory — The Main Warehouse

  • 📦 Size — Largest (8-80 GB on modern GPUs like H100)
  • 🐌 Speed — Slowest of all GPU memories (~600-2000 GB/s bandwidth)
  • 🌍 Scope — Accessible by ALL threads in ALL blocks
  • 📝 Persistence — Exists for the duration of the program
  • 🔧 How to use — cudaMalloc, cudaFree, cudaMemcpy
#include <cuda_runtime.h> #include <stdio.h> __global__ void doubleArray(int* d_arr, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i < n) { d_arr[i] *= 2; // Reading & writing global memory } } int main() { int n = 5; int h_arr[] = {1, 2, 3, 4, 5}; // Host (CPU) array int* d_arr; // Device (GPU) pointer // Step 1: Allocate GPU memory cudaMalloc(&d_arr, n * sizeof(int)); // Like malloc() but on GPU // Step 2: Copy CPU → GPU cudaMemcpy(d_arr, h_arr, n * sizeof(int), cudaMemcpyHostToDevice); // Step 3: Launch kernel doubleArray<<<1, n>>>(d_arr, n); cudaDeviceSynchronize(); // Step 4: Copy GPU → CPU cudaMemcpy(h_arr, d_arr, n * sizeof(int), cudaMemcpyDeviceToHost); // Step 5: Free GPU memory cudaFree(d_arr); // Print results for (int i = 0; i < n; i++) printf("%d ", h_arr[i]); printf("\n"); // Output: 2 4 6 8 10 return 0; }

📋 The CUDA Memory Workflow (ALWAYS in this order):

  • 1️⃣ cudaMalloc(&d_ptr, size) — Allocate GPU memory
  • 2️⃣ cudaMemcpy(d_ptr, h_ptr, size, H2D) — Upload data CPU→GPU
  • 3️⃣ kernel<<>>(d_ptr) — Process on GPU
  • 4️⃣ cudaMemcpy(h_ptr, d_ptr, size, D2H) — Download results GPU→CPU
  • 5️⃣ cudaFree(d_ptr) — Release GPU memory

Performance tip: Global memory accesses are expensive! A cache miss can cost 600-800 clock cycles. This is why shared memory (next exercise) is so valuable — it's 100× faster!

📋 Instructions
Write a program that simulates the full CUDA memory workflow. Since we're running in a regular C environment, simulate it using host arrays and print at each step: ``` === CUDA Global Memory Workflow === Step 1: Allocate GPU memory (5 ints) Step 2: Copy to GPU: 1 2 3 4 5 Step 3: Kernel runs: doubles each element Step 4: Copy from GPU: 2 4 6 8 10 Step 5: GPU memory freed Done! ```
This program is already complete! Just look at the code structure — it shows the exact CUDA workflow: Allocate → Upload → Process → Download → Free. Run it to see the output.
main.py
Hi! I'm Rex 👋
Output
Ready. Press ▶ Run or Ctrl+Enter.