Skip to main content
Version: Current

Intermediate High Performance Computing (HPC) training material

This intermediate training is designed for users who are already comfortable with the Linux command line and basic job submissions but need to optimize their workflows for performance, efficiency, and scalability. This is intended for users who have a material level equivalence of the Introduction to High Performance Computing course.

Learning Objectives

By the end of this session, students will be able to:

  • Install and manage GPU-specific Python packages using Conda on HPC
  • Request and use interactive GPU nodes correctly
  • Run multiple tasks within a single SLURM job to work within QOS limits
  • Explain Amdahl's Law and perform a basic scaling study
  • Identify the "sweet spot" for core count in their own workflows

Installing packages using conda

Why Use Conda on HPC?

On HPC clusters, you do not have administrator privileges. This means you cannot use sudo apt install or system-wide pip install to add packages. Conda solves this by giving you a fully self-contained environment in your home directory that you control entirely. Benefits of Conda on HPC:

  • Install any Python package without admin rights
  • Create isolated environments per project to avoid dependency conflicts
  • Easily reproduce environments across different machines
  • Install GPU-specific packages that link against the correct CUDA version

We will show you an example of how to insatll a GPU-specific package using conda.

Understanding GPU Nodes on HPC

Before installing GPU packages, you need to understand that:

  • Login nodes do not have GPUs, they are for editing files and submitting jobs only
  • GPU nodes are special compute nodes with NVIDIA GPUs attached
  • GPU packages must be installed on a GPU node so Conda can detect the correct CUDA version
note

Installing a GPU package on a login node may result in a CPU-only build being installed, even if you request GPU support. Always install GPU packages interactively on a GPU node.


Step 1: Check Available GPU Partitions

Before requesting a GPU node, find out which partitions have GPUs:

# List all partitions and look for GPU-related ones
sinfo -o "%P %G %N" | grep -i gpu

Example output:

test gpu:nvidia_h200_nvl:2 gnode[026-029]
test gpu:l40s:1 gnode025
test gpu:a100:2 gnode[001-010]
gpu gpu:a100:2 gnode[001-008]
cenvalarc.gpu gpu:nvidia_h200_nvl:2 gnode[026-029]

You can also check what GPU resources are available on GPU partition:

# Show GPU node details
sinfo -p gpu --Format=NodeList,Gres,GresUsed,StateCompact

Example output:

NODELIST                      GRES                GRES_USED           STATE
gnode[002,006] gpu:a100:2 gpu:a100:1(IDX:0) mix
gnode[003-004] gpu:a100:2 gpu:a100:2(IDX:0-1) alloc
gnode[001,005,007-008] gpu:a100:2 gpu:a100:0(IDX:N/A) idle

To make it easier to check GPU availability at a glance, you can use the helper script below. It filters for idle or mixed nodes and shows how many GPUs are free on each.

⬇️ Download check_gpu.sh

Transfer the file to Pinnacles using scp:

scp check_gpu.sh <username>@login.rc.ucmerced.edu:/your/path

After transferring, make it executable and run it on Pinnacles:

chmod +x check_gpu.sh
./check_gpu.sh

Example output:

=== Available GPUs on Pinnacles ===

Partition: test
Node: gnode016
Type: l40s
Free: 4 / 4 GPUs
State: idle

Partition: test
Node: gnode014
Type: h100
Free: 1 / 1 GPUs
State: mix

Partition: gpu
Node: gnode[001,005]
Type: a100
Free: 2 / 2 GPUs
State: idle

Step 2: Request an Interactive GPU Node

To install GPU packages correctly, you need to start an interactive session on a GPU node. This gives you a live terminal on the node where GPUs are physically present.

srun --partition=gpu \
--gres=gpu:1 \
--ntasks=1 \
--cpus-per-task=4 \
--mem=8G \
--time=01:00:00 \
--pty bash

If you want to target a specific node from the check_gpu.sh output above, add --nodelist:

srun --partition=gpu \
--nodelist=gnode016 \
--gres=gpu:1 \
--ntasks=1 \
--cpus-per-task=4 \
--mem=8G \
--time=01:00:00 \
--pty bash

Replace gnode016 with the node name shown under Node in the script output, and replace gpu with the corresponding Partition value.

What each flag means

FlagDescription
--partition=gpuThe partition name
--nodelist=gnode016(Optional) Pin to a specific node with available GPUs
--gres=gpu:1Request 1 GPU device
--ntasks=1Run 1 task
--cpus-per-task=4Request 4 CPU cores to pair with the GPU
--mem=8GRequest 8GB of RAM
--time=01:00:00Reserve the node for 1 hour (enough time to install packages)
--pty bashOpen an interactive bash terminal

Once the session starts, your prompt will change to show you are on a compute node:

[yyu49@gnode002 ~]$

Verify the GPU is visible

Once on the GPU node, confirm you can see the GPU:

nvidia-smi

Expected output

Tue Mar 17 14:29:33 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:17:00.0 Off | 0 |
| N/A 25C P0 32W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
note

Note the CUDA Version shown, and you will need this in the next step.


Step 3: Load the Conda Module

On our HPC clusters, Conda is available as a module. Load it before creating any environments:

module load anaconda3

Verify Conda is loaded:

conda --version

Step 4: Create a New Conda Environment

Always create a dedicated environment for GPU work. This keeps your GPU packages isolated from other projects and avoids version conflicts.

conda create -n gpu_training python=3.10 -y

What this does

PartExplanation
conda createCreates a new environment
-n gpu_trainingNames the environment gpu_training
python=3.10Installs Python 3.10 into the environment
-ySkips the confirmation prompt

Activate the new environment:

source activate gpu_training

Your prompt will update to show the active environment:

(gpu_training) [yyu49@gnode001 ~]$

Step 5: Install a GPU Package — Numba

For this training, we will install Numba, a lightweight GPU-accelerated Python package. Numba is an excellent choice for example purpose:

  • It is small (~400 MB including the CUDA toolkit) compared to PyTorch (~3+ GB) or TensorFlow (~2+ GB)
  • It uses simple Python decorators, no new language to learn
  • It clearly demonstrates the concept of moving computation to the GPU
  • It is widely used in scientific computing

Install Numba with CUDA support

conda install -c conda-forge numba cudatoolkit -y

What is being installed

PackagePurpose
numbaThe GPU-accelerated JIT compiler
cudatoolkitNVIDIA CUDA runtime libraries that Numba needs to talk to the GPU
tip

Tip: The -c conda-forge flag tells Conda to use the conda-forge channel, which has the most up-to-date and compatible builds of these packages.

warning

NumPy version conflict: numba may pull in a version of NumPy that is too new and incompatible. If you see errors like numpy.ufunc has no attribute... or similar, pin NumPy to a compatible version:

conda install -c conda-forge numba cudatoolkit "numpy<2.0" -y

Step 6: Verify the GPU Installation

Once installed, verify that Numba can see the GPU:

python -c "from numba import cuda; print(cuda.gpus)"

Expected output:

<Managed Device 0>

Managing Your Conda Environment

List all your environments

conda env list

List packages in active environment

conda list

Deactivate an environment

conda deactivate

Remove an environment (to free storage)

conda env remove -n gpu_training
warning

Conda environments can use significant disk space. Check your quota regularly with quota -s and remove environments you no longer need.

Check how much space an environment uses

du -sh ~/.conda/envs/gpu_training

Check overall storage quota

quota -s

This shows your current usage and limits for HOME, data, and scratch in a human-readable format.

Common Issues and Solutions

ProblemLikely CauseSolution
No CUDA-capable device foundInstalled on login node, not GPU nodeReinstall inside an interactive GPU session
conda: command not foundModule not loadedRun module load anaconda3 first
PackagesNotFoundErrorWrong channelAdd -c conda-forge to your install command
Disk quota exceededEnvironment too largeRemove unused environments with conda env remove
GPU visible in nvidia-smi but not NumbaCUDA version mismatchSpecify cudatoolkit=XX.X matching your driver

Running Multiple Jobs Without Job Arrays

Overview

On the Pinnacles cluster, Quality of Service (QOS) limits restrict the number of jobs a user can have in the queue or running at one time.

Because of this policy, large Slurm job arrays may not be allowed. Instead, users should bundle multiple tasks into a single Slurm job and run them inside the job allocation.

This approach is often called task farming.

What is a Job Array?

A job array is a Slurm feature that lets you submit many similar jobs with a single script using the #SBATCH --array directive. Each task in the array gets its own SLURM_ARRAY_TASK_ID, which you use to vary inputs (e.g. different files, parameters, or seeds). It is the standard way to run many independent jobs on HPC.

However, on Pinnacles the QOS policy limits how many jobs can be in the queue at once, so large arrays may be rejected or held. The task farming approach below is a workaround that achieves the same result within a single job allocation.

See the Job Arrays section in the Running Jobs guide for full documentation on how job arrays work.

Benefits:

  • avoids hitting queue limits
  • reduces scheduler load
  • improves job throughput for small tasks

When to Use This Method?

This technique works well when running:

  • parameter sweeps
  • many short simulations
  • repeated data analysis jobs

Example scenario:

You want to run 20 small serial python simulations, but the cluster allows only 6 jobs per user. Instead of submitting 20 jobs, submit one job that runs multiple tasks internally. Download the python file ⬇️ Download run_simulation.py

Method 1: Background Tasks in a Single Job

This approach launches several tasks in parallel using background processes.

#!/bin/bash
#SBATCH --job-name=multi_task
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=20
#SBATCH --time=01:00:00
#SBATCH --mem=8G
#SBATCH --output=multi_task_%j.out

module load python

for i in {1..19}; do
python run_simulation.py $i &
done

python run_simulation.py 20

wait

This is a simple way to execute multiple Python jobs within a single job submission. However, this approach may lead to CPU oversubscription, where multiple Python processes run on the same core. This can reduce job efficiency and negatively impact overall performance.

How This Works

StepExplanation
loop through tasksruns simulation with different parameters
&runs each command in background
waitensures all tasks complete before exiting

Method 2: Using srun for Better Resource Control

A more scheduler-friendly method uses srun to launch task. Example Script

#!/bin/bash
#SBATCH --job-name=task_farm
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=20
#SBATCH --time=01:00:00
#SBATCH --mem=8G
#SBATCH --output=taskfarm_%j.out

module load python

TASKFILE=tasks.txt

srun bash -c '
TASK_ID=$SLURM_PROCID
TOTAL_TASKS=$SLURM_NTASKS
# sed selects every Nth line from the task file.
# Each worker receives a different subset of tasks.
#
# Example with 4 workers with 10 jobs total:
# worker0 → lines 1,5,9
# worker1 → lines 2,6,10
# worker2 → lines 3,7
# worker3 → lines 4,8
# IFS= disables word splitting so spaces in commands are preserved
# -r disables backslash interpretation so \n in commands stays literal

sed -n "$((TASK_ID+1))~$TOTAL_TASKS p" '"$TASKFILE"' | while IFS= read -r cmd
do
echo "Task ${TASK_ID} running: $cmd"
eval "$cmd"
done
'
PartMeaning
IFS=Disable word splitting (preserves spaces in commands)
read -rRead one line at a time into variable cmd
eval "$cmd"Execute whatever command is in that line

So if cmd = "python run_simulation.py 5", then eval "$cmd" runs exactly that.

Create a separate test file called tasks.txt

python run_simulation.py 1
python run_simulation.py 2
python run_simulation.py 3
python run_simulation.py 4
python run_simulation.py 5
python run_simulation.py 6
python run_simulation.py 7
python run_simulation.py 8
python run_simulation.py 9
python run_simulation.py 10
python run_simulation.py 11
python run_simulation.py 12
python run_simulation.py 13
python run_simulation.py 14
python run_simulation.py 15
python run_simulation.py 16
python run_simulation.py 17
python run_simulation.py 18
python run_simulation.py 19
python run_simulation.py 20

What this does

  • #SBATCH --ntasks=20 requests 20 Slurm tasks, which usually means 20 CPU cores
  • srun... is called once
  • Slurm launches 20 worker processes
  • each worker gets a different SLURM_PROCID from 0 to 19
  • sed splits tasks.txt across those 20 workers
  • each worker runs its assigned commands sequentially
Example: 4 parallel workers, 10 CPUs each, running 20 jobs total

In this example, you have 20 jobs to run and want to use 4 workers in parallel, each using 10 CPU cores. The 4 workers run simultaneously and divide the 20 jobs evenly — each worker handles 5 jobs sequentially. All 4 workers finish at roughly the same time.

SLURM script:

#!/bin/bash
#SBATCH --job-name=task_farm_4workers
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=10
#SBATCH --time=02:00:00
#SBATCH --mem=40G
#SBATCH --output=taskfarm_%j.out

module load python

TASKFILE=tasks.txt

srun bash -c '
TASK_ID=$SLURM_PROCID
TOTAL_TASKS=$SLURM_NTASKS

sed -n "$((TASK_ID+1))~$TOTAL_TASKS p" '"$TASKFILE"' | while IFS= read -r cmd
do
echo "Worker ${TASK_ID} running: $cmd"
eval "$cmd"
done
'

tasks.txt (20 jobs total):

python run_parallel.py 1
python run_parallel.py 2
python run_parallel.py 3
python run_parallel.py 4
python run_parallel.py 5
python run_parallel.py 6
python run_parallel.py 7
python run_parallel.py 8
python run_parallel.py 9
python run_parallel.py 10
python run_parallel.py 11
python run_parallel.py 12
python run_parallel.py 13
python run_parallel.py 14
python run_parallel.py 15
python run_parallel.py 16
python run_parallel.py 17
python run_parallel.py 18
python run_parallel.py 19
python run_parallel.py 20
note

You don't need to pass the number of workers explicitly — run_parallel.py reads $SLURM_CPUS_PER_TASK automatically, so it will use all 10 CPUs allocated per task. If you want to override it, you can pass it as a second argument: python run_parallel.py 1 10.

How the 20 jobs are divided across 4 workers:

WorkerSLURM_PROCIDJobs assigned
Worker 00job 1, 5, 9, 13, 17
Worker 11job 2, 6, 10, 14, 18
Worker 22job 3, 7, 11, 15, 19
Worker 33job 4, 8, 12, 16, 20

Each worker runs its 5 jobs sequentially, but all 4 workers run in parallel — so all 20 jobs complete in the time it takes to run 5 jobs. To scale this to more jobs, simply add more lines to tasks.txt. To add more parallel workers, increase --ntasks.

The simulation script used in tasks.txt above is a parallel version that uses all 10 CPUs assigned to each worker via Python's multiprocessing module. Download it here:

⬇️ Download run_parallel.py

Checking Available Resources on a Mixed-State Node

A node in mix state means it is partially allocated — some CPUs and memory are in use but others are still free. Before submitting a job, you can inspect exactly how many CPUs and how much memory remain available on a specific node using scontrol:

scontrol show node <nodename>

For example:

scontrol show node gnode009

Look for these fields in the output:

FieldMeaning
CPUTotTotal CPUs on the node
CPUAllocCPUs currently allocated to running jobs
CPULoadActual CPU load (can differ from allocated)
RealMemoryTotal memory (MB) on the node
AllocMemMemory currently allocated
FreeMemMemory currently free
CfgTRESTotal resources configured on the node
AllocTRESResources currently allocated

The free CPUs available for your job is CPUTot - CPUAlloc, and the free memory is RealMemory - AllocMem (in MB). Make sure your --ntasks, --cpus-per-task, and --mem combination does not exceed what is free on the node.

Finding Your "Sweet Spot": A Guide to HPC Scaling

One of the most common misconceptions in HPC is that doubling the number of cores will halve the time it takes to finish a job. In reality, efficiency often drops as you scale up. This guide will help you understand how to find the "Sweet Spot"—the point where you get the most science done without wasting cluster resources

Why "More" is not Always "Faster"

Every parallel program has two parts:

  1. The Parallel Part: Tasks that can be split up (e.g., math on different parts of a matrix).

  2. The Serial Part: Tasks that must happen one after another (e.g., reading an input file, starting the MPI environment, or gathering results).

As you add more cores, the Parallel Part gets faster, but the Serial Part stays the same. Eventually, the time spent on communication between cores (the "overhead") becomes larger than the time spent on actual calculation.

Amdahl's Law

This law defines the maximum speedup possible. If 10% of your code is serial, your job can never be more than 10x faster, no matter if you use 100 or 1000 cores.

How to Find the "Sweet Spot" (Step-by-Step)

To find the perfect scaling, you should run a Scaling Study before launching a massive production run.

  1. The test run Run the same small test case on different number of cores, for example, 1,2,4,8,16,and 32 cores. Record the "wallclock time" for each

  2. Calculate Efficiency Use this formula below to see how much efficiency you actually get

    Efficiency=TserialN×TN×100%Efficiency = \frac{T_{serial}}{N \times T_N} \times 100\%
    • TserialT_{serial}: Time taken on 1 core
    • TNT_N: Time taken on N cores

The Rule of Thumb: If your efficiency drops below 70%, you have passed the sweet spot. You are now wasting cluster "Service Units" for very little gain.

Script Example: Automating a scaling study

#!/bin/bash
# scaling_study.sh
# Runs the same job at increasing core counts and prints efficiency for each.
# Efficiency = (T_serial / (N * T_N)) * 100%

RESULTS_FILE="scaling_results.txt"
echo "Cores | Time (s) | Efficiency (%)" > $RESULTS_FILE
echo "------|----------|----------------" >> $RESULTS_FILE

T_SERIAL=""

for CORES in 1 2 4 8 16 32
do
echo "Running with $CORES cores..."

START=$(date +%s%N)
srun --ntasks=1 --nodes=1 --cpus-per-task=$CORES python run_parallel.py 1 $CORES
END=$(date +%s%N)

# Elapsed time in seconds (floating point)
T_N=$(echo "scale=3; ($END - $START) / 1000000000" | bc)

# Store serial time on first iteration
if [ -z "$T_SERIAL" ]; then
T_SERIAL=$T_N
fi

# Efficiency = (T_serial / (N * T_N)) * 100
EFFICIENCY=$(echo "scale=1; ($T_SERIAL / ($CORES * $T_N)) * 100" | bc)

echo "$CORES | $T_N | $EFFICIENCY%" | tee -a $RESULTS_FILE
done

echo ""
echo "=== Scaling Study Complete ==="
cat $RESULTS_FILE