Intermediate High Performance Computing (HPC) training material
This intermediate training is designed for users who are already comfortable with the Linux command line and basic job submissions but need to optimize their workflows for performance, efficiency, and scalability. This is intended for users who have a material level equivalence of the Introduction to High Performance Computing course.
By the end of this session, students will be able to:
- Install and manage GPU-specific Python packages using Conda on HPC
- Request and use interactive GPU nodes correctly
- Run multiple tasks within a single SLURM job to work within QOS limits
- Explain Amdahl's Law and perform a basic scaling study
- Identify the "sweet spot" for core count in their own workflows
Installing packages using conda
Why Use Conda on HPC?
On HPC clusters, you do not have administrator privileges. This means you cannot use sudo apt install or system-wide pip install to add packages. Conda solves this by giving you a fully self-contained environment in your home directory that you control entirely.
Benefits of Conda on HPC:
- Install any Python package without admin rights
- Create isolated environments per project to avoid dependency conflicts
- Easily reproduce environments across different machines
- Install GPU-specific packages that link against the correct CUDA version
We will show you an example of how to insatll a GPU-specific package using conda.
Understanding GPU Nodes on HPC
Before installing GPU packages, you need to understand that:
- Login nodes do not have GPUs, they are for editing files and submitting jobs only
- GPU nodes are special compute nodes with NVIDIA GPUs attached
- GPU packages must be installed on a GPU node so Conda can detect the correct CUDA version
Installing a GPU package on a login node may result in a CPU-only build being installed, even if you request GPU support. Always install GPU packages interactively on a GPU node.
Step 1: Check Available GPU Partitions
Before requesting a GPU node, find out which partitions have GPUs:
# List all partitions and look for GPU-related ones
sinfo -o "%P %G %N" | grep -i gpu
Example output:
test gpu:nvidia_h200_nvl:2 gnode[026-029]
test gpu:l40s:1 gnode025
test gpu:a100:2 gnode[001-010]
gpu gpu:a100:2 gnode[001-008]
cenvalarc.gpu gpu:nvidia_h200_nvl:2 gnode[026-029]
You can also check what GPU resources are available on GPU partition:
# Show GPU node details
sinfo -p gpu --Format=NodeList,Gres,GresUsed,StateCompact
Example output:
NODELIST GRES GRES_USED STATE
gnode[002,006] gpu:a100:2 gpu:a100:1(IDX:0) mix
gnode[003-004] gpu:a100:2 gpu:a100:2(IDX:0-1) alloc
gnode[001,005,007-008] gpu:a100:2 gpu:a100:0(IDX:N/A) idle
To make it easier to check GPU availability at a glance, you can use the helper script below. It filters for idle or mixed nodes and shows how many GPUs are free on each.
⬇️ Download check_gpu.shTransfer the file to Pinnacles using scp:
scp check_gpu.sh <username>@login.rc.ucmerced.edu:/your/path
After transferring, make it executable and run it on Pinnacles:
chmod +x check_gpu.sh
./check_gpu.sh
Example output:
=== Available GPUs on Pinnacles ===
Partition: test
Node: gnode016
Type: l40s
Free: 4 / 4 GPUs
State: idle
Partition: test
Node: gnode014
Type: h100
Free: 1 / 1 GPUs
State: mix
Partition: gpu
Node: gnode[001,005]
Type: a100
Free: 2 / 2 GPUs
State: idle
Step 2: Request an Interactive GPU Node
To install GPU packages correctly, you need to start an interactive session on a GPU node. This gives you a live terminal on the node where GPUs are physically present.
srun --partition=gpu \
--gres=gpu:1 \
--ntasks=1 \
--cpus-per-task=4 \
--mem=8G \
--time=01:00:00 \
--pty bash
If you want to target a specific node from the check_gpu.sh output above, add --nodelist:
srun --partition=gpu \
--nodelist=gnode016 \
--gres=gpu:1 \
--ntasks=1 \
--cpus-per-task=4 \
--mem=8G \
--time=01:00:00 \
--pty bash
Replace gnode016 with the node name shown under Node in the script output, and replace gpu with the corresponding Partition value.
What each flag means
| Flag | Description |
|---|---|
--partition=gpu | The partition name |
--nodelist=gnode016 | (Optional) Pin to a specific node with available GPUs |
--gres=gpu:1 | Request 1 GPU device |
--ntasks=1 | Run 1 task |
--cpus-per-task=4 | Request 4 CPU cores to pair with the GPU |
--mem=8G | Request 8GB of RAM |
--time=01:00:00 | Reserve the node for 1 hour (enough time to install packages) |
--pty bash | Open an interactive bash terminal |
Once the session starts, your prompt will change to show you are on a compute node:
[yyu49@gnode002 ~]$
Verify the GPU is visible
Once on the GPU node, confirm you can see the GPU:
nvidia-smi
Expected output
Tue Mar 17 14:29:33 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:17:00.0 Off | 0 |
| N/A 25C P0 32W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Note the CUDA Version shown, and you will need this in the next step.
Step 3: Load the Conda Module
On our HPC clusters, Conda is available as a module. Load it before creating any environments:
module load anaconda3
Verify Conda is loaded:
conda --version
Step 4: Create a New Conda Environment
Always create a dedicated environment for GPU work. This keeps your GPU packages isolated from other projects and avoids version conflicts.
conda create -n gpu_training python=3.10 -y
What this does
| Part | Explanation |
|---|---|
conda create | Creates a new environment |
-n gpu_training | Names the environment gpu_training |
python=3.10 | Installs Python 3.10 into the environment |
-y | Skips the confirmation prompt |
Activate the new environment:
source activate gpu_training
Your prompt will update to show the active environment:
(gpu_training) [yyu49@gnode001 ~]$
Step 5: Install a GPU Package — Numba
For this training, we will install Numba, a lightweight GPU-accelerated Python package. Numba is an excellent choice for example purpose:
- It is small (~400 MB including the CUDA toolkit) compared to PyTorch (~3+ GB) or TensorFlow (~2+ GB)
- It uses simple Python decorators, no new language to learn
- It clearly demonstrates the concept of moving computation to the GPU
- It is widely used in scientific computing
Install Numba with CUDA support
conda install -c conda-forge numba cudatoolkit -y
What is being installed
| Package | Purpose |
|---|---|
numba | The GPU-accelerated JIT compiler |
cudatoolkit | NVIDIA CUDA runtime libraries that Numba needs to talk to the GPU |
Tip: The -c conda-forge flag tells Conda to use the conda-forge channel, which has the most up-to-date and compatible builds of these packages.
NumPy version conflict: numba may pull in a version of NumPy that is too new and incompatible. If you see errors like numpy.ufunc has no attribute... or similar, pin NumPy to a compatible version:
conda install -c conda-forge numba cudatoolkit "numpy<2.0" -y
Step 6: Verify the GPU Installation
Once installed, verify that Numba can see the GPU:
python -c "from numba import cuda; print(cuda.gpus)"
Expected output:
<Managed Device 0>
Managing Your Conda Environment
List all your environments
conda env list
List packages in active environment
conda list
Deactivate an environment
conda deactivate
Remove an environment (to free storage)
conda env remove -n gpu_training
Conda environments can use significant disk space. Check your quota regularly with quota -s and remove environments you no longer need.
Check how much space an environment uses
du -sh ~/.conda/envs/gpu_training
Check overall storage quota
quota -s
This shows your current usage and limits for HOME, data, and scratch in a human-readable format.
Common Issues and Solutions
| Problem | Likely Cause | Solution |
|---|---|---|
No CUDA-capable device found | Installed on login node, not GPU node | Reinstall inside an interactive GPU session |
conda: command not found | Module not loaded | Run module load anaconda3 first |
PackagesNotFoundError | Wrong channel | Add -c conda-forge to your install command |
Disk quota exceeded | Environment too large | Remove unused environments with conda env remove |
GPU visible in nvidia-smi but not Numba | CUDA version mismatch | Specify cudatoolkit=XX.X matching your driver |
Running Multiple Jobs Without Job Arrays
Overview
On the Pinnacles cluster, Quality of Service (QOS) limits restrict the number of jobs a user can have in the queue or running at one time.
Because of this policy, large Slurm job arrays may not be allowed. Instead, users should bundle multiple tasks into a single Slurm job and run them inside the job allocation.
This approach is often called task farming.
A job array is a Slurm feature that lets you submit many similar jobs with a single script using the #SBATCH --array directive. Each task in the array gets its own SLURM_ARRAY_TASK_ID, which you use to vary inputs (e.g. different files, parameters, or seeds). It is the standard way to run many independent jobs on HPC.
However, on Pinnacles the QOS policy limits how many jobs can be in the queue at once, so large arrays may be rejected or held. The task farming approach below is a workaround that achieves the same result within a single job allocation.
See the Job Arrays section in the Running Jobs guide for full documentation on how job arrays work.
Benefits:
- avoids hitting queue limits
- reduces scheduler load
- improves job throughput for small tasks
When to Use This Method?
This technique works well when running:
- parameter sweeps
- many short simulations
- repeated data analysis jobs
Example scenario:
You want to run 20 small serial python simulations, but the cluster allows only 6 jobs per user. Instead of submitting 20 jobs, submit one job that runs multiple tasks internally. Download the python file ⬇️ Download run_simulation.py
Method 1: Background Tasks in a Single Job
This approach launches several tasks in parallel using background processes.
#!/bin/bash
#SBATCH --job-name=multi_task
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=20
#SBATCH --time=01:00:00
#SBATCH --mem=8G
#SBATCH --output=multi_task_%j.out
module load python
for i in {1..19}; do
python run_simulation.py $i &
done
python run_simulation.py 20
wait
This is a simple way to execute multiple Python jobs within a single job submission. However, this approach may lead to CPU oversubscription, where multiple Python processes run on the same core. This can reduce job efficiency and negatively impact overall performance.
How This Works
| Step | Explanation |
|---|---|
| loop through tasks | runs simulation with different parameters |
& | runs each command in background |
wait | ensures all tasks complete before exiting |
Method 2: Using srun for Better Resource Control
A more scheduler-friendly method uses srun to launch task. Example Script
#!/bin/bash
#SBATCH --job-name=task_farm
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=20
#SBATCH --time=01:00:00
#SBATCH --mem=8G
#SBATCH --output=taskfarm_%j.out
module load python
TASKFILE=tasks.txt
srun bash -c '
TASK_ID=$SLURM_PROCID
TOTAL_TASKS=$SLURM_NTASKS
# sed selects every Nth line from the task file.
# Each worker receives a different subset of tasks.
#
# Example with 4 workers with 10 jobs total:
# worker0 → lines 1,5,9
# worker1 → lines 2,6,10
# worker2 → lines 3,7
# worker3 → lines 4,8
# IFS= disables word splitting so spaces in commands are preserved
# -r disables backslash interpretation so \n in commands stays literal
sed -n "$((TASK_ID+1))~$TOTAL_TASKS p" '"$TASKFILE"' | while IFS= read -r cmd
do
echo "Task ${TASK_ID} running: $cmd"
eval "$cmd"
done
'
| Part | Meaning |
|---|---|
IFS= | Disable word splitting (preserves spaces in commands) |
read -r | Read one line at a time into variable cmd |
eval "$cmd" | Execute whatever command is in that line |
So if cmd = "python run_simulation.py 5", then eval "$cmd" runs exactly that.
Create a separate test file called tasks.txt
python run_simulation.py 1
python run_simulation.py 2
python run_simulation.py 3
python run_simulation.py 4
python run_simulation.py 5
python run_simulation.py 6
python run_simulation.py 7
python run_simulation.py 8
python run_simulation.py 9
python run_simulation.py 10
python run_simulation.py 11
python run_simulation.py 12
python run_simulation.py 13
python run_simulation.py 14
python run_simulation.py 15
python run_simulation.py 16
python run_simulation.py 17
python run_simulation.py 18
python run_simulation.py 19
python run_simulation.py 20
What this does
#SBATCH --ntasks=20requests 20 Slurm tasks, which usually means 20 CPU coressrun... is called once- Slurm launches 20 worker processes
- each worker gets a different
SLURM_PROCIDfrom 0 to 19 sedsplitstasks.txtacross those 20 workers- each worker runs its assigned commands sequentially
Example: 4 parallel workers, 10 CPUs each, running 20 jobs total
In this example, you have 20 jobs to run and want to use 4 workers in parallel, each using 10 CPU cores. The 4 workers run simultaneously and divide the 20 jobs evenly — each worker handles 5 jobs sequentially. All 4 workers finish at roughly the same time.
SLURM script:
#!/bin/bash
#SBATCH --job-name=task_farm_4workers
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=10
#SBATCH --time=02:00:00
#SBATCH --mem=40G
#SBATCH --output=taskfarm_%j.out
module load python
TASKFILE=tasks.txt
srun bash -c '
TASK_ID=$SLURM_PROCID
TOTAL_TASKS=$SLURM_NTASKS
sed -n "$((TASK_ID+1))~$TOTAL_TASKS p" '"$TASKFILE"' | while IFS= read -r cmd
do
echo "Worker ${TASK_ID} running: $cmd"
eval "$cmd"
done
'
tasks.txt (20 jobs total):
python run_parallel.py 1
python run_parallel.py 2
python run_parallel.py 3
python run_parallel.py 4
python run_parallel.py 5
python run_parallel.py 6
python run_parallel.py 7
python run_parallel.py 8
python run_parallel.py 9
python run_parallel.py 10
python run_parallel.py 11
python run_parallel.py 12
python run_parallel.py 13
python run_parallel.py 14
python run_parallel.py 15
python run_parallel.py 16
python run_parallel.py 17
python run_parallel.py 18
python run_parallel.py 19
python run_parallel.py 20
You don't need to pass the number of workers explicitly — run_parallel.py reads $SLURM_CPUS_PER_TASK automatically, so it will use all 10 CPUs allocated per task. If you want to override it, you can pass it as a second argument: python run_parallel.py 1 10.
How the 20 jobs are divided across 4 workers:
| Worker | SLURM_PROCID | Jobs assigned |
|---|---|---|
| Worker 0 | 0 | job 1, 5, 9, 13, 17 |
| Worker 1 | 1 | job 2, 6, 10, 14, 18 |
| Worker 2 | 2 | job 3, 7, 11, 15, 19 |
| Worker 3 | 3 | job 4, 8, 12, 16, 20 |
Each worker runs its 5 jobs sequentially, but all 4 workers run in parallel — so all 20 jobs complete in the time it takes to run 5 jobs. To scale this to more jobs, simply add more lines to tasks.txt. To add more parallel workers, increase --ntasks.
The simulation script used in tasks.txt above is a parallel version that uses all 10 CPUs assigned to each worker via Python's multiprocessing module. Download it here:
Checking Available Resources on a Mixed-State Node
A node in mix state means it is partially allocated — some CPUs and memory are in use but others are still free. Before submitting a job, you can inspect exactly how many CPUs and how much memory remain available on a specific node using scontrol:
scontrol show node <nodename>
For example:
scontrol show node gnode009
Look for these fields in the output:
| Field | Meaning |
|---|---|
CPUTot | Total CPUs on the node |
CPUAlloc | CPUs currently allocated to running jobs |
CPULoad | Actual CPU load (can differ from allocated) |
RealMemory | Total memory (MB) on the node |
AllocMem | Memory currently allocated |
FreeMem | Memory currently free |
CfgTRES | Total resources configured on the node |
AllocTRES | Resources currently allocated |
The free CPUs available for your job is CPUTot - CPUAlloc, and the free memory is RealMemory - AllocMem (in MB). Make sure your --ntasks, --cpus-per-task, and --mem combination does not exceed what is free on the node.
Finding Your "Sweet Spot": A Guide to HPC Scaling
One of the most common misconceptions in HPC is that doubling the number of cores will halve the time it takes to finish a job. In reality, efficiency often drops as you scale up. This guide will help you understand how to find the "Sweet Spot"—the point where you get the most science done without wasting cluster resources
Why "More" is not Always "Faster"
Every parallel program has two parts:
-
The Parallel Part: Tasks that can be split up (e.g., math on different parts of a matrix).
-
The Serial Part: Tasks that must happen one after another (e.g., reading an input file, starting the MPI environment, or gathering results).
As you add more cores, the Parallel Part gets faster, but the Serial Part stays the same. Eventually, the time spent on communication between cores (the "overhead") becomes larger than the time spent on actual calculation.
Amdahl's Law
This law defines the maximum speedup possible. If 10% of your code is serial, your job can never be more than 10x faster, no matter if you use 100 or 1000 cores.
How to Find the "Sweet Spot" (Step-by-Step)
To find the perfect scaling, you should run a Scaling Study before launching a massive production run.
-
The test run Run the same small test case on different number of cores, for example, 1,2,4,8,16,and 32 cores. Record the "wallclock time" for each
-
Calculate Efficiency Use this formula below to see how much efficiency you actually get
- : Time taken on 1 core
- : Time taken on N cores
The Rule of Thumb: If your efficiency drops below 70%, you have passed the sweet spot. You are now wasting cluster "Service Units" for very little gain.
Script Example: Automating a scaling study
#!/bin/bash
# scaling_study.sh
# Runs the same job at increasing core counts and prints efficiency for each.
# Efficiency = (T_serial / (N * T_N)) * 100%
RESULTS_FILE="scaling_results.txt"
echo "Cores | Time (s) | Efficiency (%)" > $RESULTS_FILE
echo "------|----------|----------------" >> $RESULTS_FILE
T_SERIAL=""
for CORES in 1 2 4 8 16 32
do
echo "Running with $CORES cores..."
START=$(date +%s%N)
srun --ntasks=1 --nodes=1 --cpus-per-task=$CORES python run_parallel.py 1 $CORES
END=$(date +%s%N)
# Elapsed time in seconds (floating point)
T_N=$(echo "scale=3; ($END - $START) / 1000000000" | bc)
# Store serial time on first iteration
if [ -z "$T_SERIAL" ]; then
T_SERIAL=$T_N
fi
# Efficiency = (T_serial / (N * T_N)) * 100
EFFICIENCY=$(echo "scale=1; ($T_SERIAL / ($CORES * $T_N)) * 100" | bc)
echo "$CORES | $T_N | $EFFICIENCY%" | tee -a $RESULTS_FILE
done
echo ""
echo "=== Scaling Study Complete ==="
cat $RESULTS_FILE