Version: Current

Intermediate High Performance Computing (HPC) training material

This intermediate training is designed for users who are already comfortable with the Linux command line and basic job submissions but need to optimize their workflows for performance, efficiency, and scalability. This is intended for users who have a material level equivalence of the Introduction to High Performance Computing course.

Learning Objectives

By the end of this session, students will be able to:

Install and manage GPU-specific Python packages using Conda on HPC
Request and use interactive GPU nodes correctly
Run multiple tasks within a single SLURM job to work within QOS limits
Explain Amdahl's Law and perform a basic scaling study
Identify the "sweet spot" for core count in their own workflows

Installing packages using conda

Why Use Conda on HPC?

On HPC clusters, you do not have administrator privileges. This means you cannot use sudo apt install or system-wide pip install to add packages. Conda solves this by giving you a fully self-contained environment in your home directory that you control entirely. Benefits of Conda on HPC:

Install any Python package without admin rights
Create isolated environments per project to avoid dependency conflicts
Easily reproduce environments across different machines
Install GPU-specific packages that link against the correct CUDA version

We will show you an example of how to insatll a GPU-specific package using conda.

Understanding GPU Nodes on HPC

Before installing GPU packages, you need to understand that:

Login nodes do not have GPUs, they are for editing files and submitting jobs only
GPU nodes are special compute nodes with NVIDIA GPUs attached
GPU packages must be installed on a GPU node so Conda can detect the correct CUDA version

note

Installing a GPU package on a login node may result in a CPU-only build being installed, even if you request GPU support. Always install GPU packages interactively on a GPU node.

Step 1: Check Available GPU Partitions

Before requesting a GPU node, find out which partitions have GPUs:

# List all partitions and look for GPU-related ones
sinfo -o "%P %G %N" | grep -i gpu

Example output:

test gpu:nvidia_h200_nvl:2 gnode[026-029]
test gpu:l40s:1 gnode025
test gpu:a100:2 gnode[001-010]
gpu gpu:a100:2 gnode[001-008]
cenvalarc.gpu gpu:nvidia_h200_nvl:2 gnode[026-029]

You can also check what GPU resources are available on GPU partition:

# Show GPU node details
sinfo -p gpu --Format=NodeList,Gres,GresUsed,StateCompact

Example output:

NODELIST                      GRES                GRES_USED           STATE
gnode[002,006]                gpu:a100:2          gpu:a100:1(IDX:0)   mix
gnode[003-004]                gpu:a100:2          gpu:a100:2(IDX:0-1) alloc
gnode[001,005,007-008]        gpu:a100:2          gpu:a100:0(IDX:N/A) idle

To make it easier to check GPU availability at a glance, you can use the helper script below. It filters for idle or mixed nodes and shows how many GPUs are free on each.

⬇️ Download check_gpu.sh

Transfer the file to Pinnacles using scp:

scp check_gpu.sh <username>@login.rc.ucmerced.edu:/your/path

After transferring, make it executable and run it on Pinnacles:

chmod +x check_gpu.sh
./check_gpu.sh

Example output:

=== Available GPUs on Pinnacles ===

  Partition: test
  Node: gnode016
  Type: l40s
  Free: 4 / 4 GPUs
  State: idle

  Partition: test
  Node: gnode014
  Type: h100
  Free: 1 / 1 GPUs
  State: mix

  Partition: gpu
  Node: gnode[001,005]
  Type: a100
  Free: 2 / 2 GPUs
  State: idle

Step 2: Request an Interactive GPU Node

To install GPU packages correctly, you need to start an interactive session on a GPU node. This gives you a live terminal on the node where GPUs are physically present.

srun --partition=gpu \
     --gres=gpu:1 \
     --ntasks=1 \
     --cpus-per-task=4 \
     --mem=8G \
     --time=01:00:00 \
     --pty bash

If you want to target a specific node from the check_gpu.sh output above, add --nodelist:

srun --partition=gpu \
     --nodelist=gnode016 \
     --gres=gpu:1 \
     --ntasks=1 \
     --cpus-per-task=4 \
     --mem=8G \
     --time=01:00:00 \
     --pty bash

Replace gnode016 with the node name shown under Node in the script output, and replace gpu with the corresponding Partition value.

What each flag means

Flag	Description
`--partition=gpu`	The partition name
`--nodelist=gnode016`	(Optional) Pin to a specific node with available GPUs
`--gres=gpu:1`	Request 1 GPU device
`--ntasks=1`	Run 1 task
`--cpus-per-task=4`	Request 4 CPU cores to pair with the GPU
`--mem=8G`	Request 8GB of RAM
`--time=01:00:00`	Reserve the node for 1 hour (enough time to install packages)
`--pty bash`	Open an interactive bash terminal

Once the session starts, your prompt will change to show you are on a compute node:

[yyu49@gnode002 ~]$

Verify the GPU is visible

Once on the GPU node, confirm you can see the GPU:

nvidia-smi

Expected output

Tue Mar 17 14:29:33 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:17:00.0 Off |                    0 |
| N/A   25C    P0             32W /  250W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

note

Note the CUDA Version shown, and you will need this in the next step.

Step 3: Load the Conda Module

On our HPC clusters, Conda is available as a module. Load it before creating any environments:

module load anaconda3

Verify Conda is loaded:

conda --version

Step 4: Create a New Conda Environment

Always create a dedicated environment for GPU work. This keeps your GPU packages isolated from other projects and avoids version conflicts.

conda create -n gpu_training python=3.10 -y

What this does

Part	Explanation
`conda create`	Creates a new environment
`-n gpu_training`	Names the environment `gpu_training`
`python=3.10`	Installs Python 3.10 into the environment
`-y`	Skips the confirmation prompt

Activate the new environment:

source activate gpu_training

Your prompt will update to show the active environment:

(gpu_training) [yyu49@gnode001 ~]$

Step 5: Install a GPU Package — Numba

For this training, we will install Numba, a lightweight GPU-accelerated Python package. Numba is an excellent choice for example purpose:

It is small (~400 MB including the CUDA toolkit) compared to PyTorch (~3+ GB) or TensorFlow (~2+ GB)
It uses simple Python decorators, no new language to learn
It clearly demonstrates the concept of moving computation to the GPU
It is widely used in scientific computing

Install Numba with CUDA support

conda install -c conda-forge numba cudatoolkit -y

What is being installed

Package	Purpose
`numba`	The GPU-accelerated JIT compiler
`cudatoolkit`	NVIDIA CUDA runtime libraries that Numba needs to talk to the GPU

tip

Tip: The -c conda-forge flag tells Conda to use the conda-forge channel, which has the most up-to-date and compatible builds of these packages.

warning

NumPy version conflict: numba may pull in a version of NumPy that is too new and incompatible. If you see errors like numpy.ufunc has no attribute... or similar, pin NumPy to a compatible version:

conda install -c conda-forge numba cudatoolkit "numpy<2.0" -y

Step 6: Verify the GPU Installation

Once installed, verify that Numba can see the GPU:

python -c "from numba import cuda; print(cuda.gpus)"

Expected output:

<Managed Device 0>

Managing Your Conda Environment

List all your environments

conda env list

List packages in active environment

conda list

Deactivate an environment

conda deactivate

Remove an environment (to free storage)

conda env remove -n gpu_training

warning

Conda environments can use significant disk space. Check your quota regularly with quota -s and remove environments you no longer need.

Check how much space an environment uses

du -sh ~/.conda/envs/gpu_training

Check overall storage quota

quota -s

This shows your current usage and limits for HOME, data, and scratch in a human-readable format.

Common Issues and Solutions

Problem	Likely Cause	Solution
`No CUDA-capable device found`	Installed on login node, not GPU node	Reinstall inside an interactive GPU session
`conda: command not found`	Module not loaded	Run `module load anaconda3` first
`PackagesNotFoundError`	Wrong channel	Add `-c conda-forge` to your install command
`Disk quota exceeded`	Environment too large	Remove unused environments with `conda env remove`
GPU visible in `nvidia-smi` but not Numba	CUDA version mismatch	Specify `cudatoolkit=XX.X` matching your driver

Running Multiple Jobs Without Job Arrays

Overview

On the Pinnacles cluster, Quality of Service (QOS) limits restrict the number of jobs a user can have in the queue or running at one time.

Because of this policy, large Slurm job arrays may not be allowed. Instead, users should bundle multiple tasks into a single Slurm job and run them inside the job allocation.

This approach is often called task farming.

What is a Job Array?

A job array is a Slurm feature that lets you submit many similar jobs with a single script using the #SBATCH --array directive. Each task in the array gets its own SLURM_ARRAY_TASK_ID, which you use to vary inputs (e.g. different files, parameters, or seeds). It is the standard way to run many independent jobs on HPC.

However, on Pinnacles the QOS policy limits how many jobs can be in the queue at once, so large arrays may be rejected or held. The task farming approach below is a workaround that achieves the same result within a single job allocation.

See the Job Arrays section in the Running Jobs guide for full documentation on how job arrays work.

Benefits:

avoids hitting queue limits
reduces scheduler load
improves job throughput for small tasks

When to Use This Method?

This technique works well when running:

parameter sweeps
many short simulations
repeated data analysis jobs

Example scenario:

You want to run 20 small serial python simulations, but the cluster allows only 6 jobs per user. Instead of submitting 20 jobs, submit one job that runs multiple tasks internally. Download the python file ⬇️ Download run_simulation.py

Method 1: Background Tasks in a Single Job

This approach launches several tasks in parallel using background processes.

#!/bin/bash
#SBATCH --job-name=multi_task
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=20
#SBATCH --time=01:00:00
#SBATCH --mem=8G
#SBATCH --output=multi_task_%j.out

module load python

for i in {1..19}; do
    python run_simulation.py $i &
done

python run_simulation.py 20

wait

This is a simple way to execute multiple Python jobs within a single job submission. However, this approach may lead to CPU oversubscription, where multiple Python processes run on the same core. This can reduce job efficiency and negatively impact overall performance.

How This Works

Step	Explanation
loop through tasks	runs simulation with different parameters
`&`	runs each command in background
`wait`	ensures all tasks complete before exiting

Method 2: Using `srun` for Better Resource Control

A more scheduler-friendly method uses srun to launch task. Example Script

#!/bin/bash
#SBATCH --job-name=task_farm
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=20
#SBATCH --time=01:00:00
#SBATCH --mem=8G
#SBATCH --output=taskfarm_%j.out

module load python

TASKFILE=tasks.txt

srun bash -c '
TASK_ID=$SLURM_PROCID
TOTAL_TASKS=$SLURM_NTASKS
# sed selects every Nth line from the task file.
# Each worker receives a different subset of tasks.
#
# Example with 4 workers with 10 jobs total:
# worker0 → lines 1,5,9
# worker1 → lines 2,6,10
# worker2 → lines 3,7
# worker3 → lines 4,8
# IFS=  disables word splitting so spaces in commands are preserved
# -r disables backslash interpretation so \n in commands stays literal

sed -n "$((TASK_ID+1))~$TOTAL_TASKS p" '"$TASKFILE"' | while IFS= read -r cmd
do
    echo "Task ${TASK_ID} running: $cmd"
    eval "$cmd"
done
'

Part	Meaning
`IFS=`	Disable word splitting (preserves spaces in commands)
`read -r`	Read one line at a time into variable `cmd`
`eval "$cmd"`	Execute whatever command is in that line

So if cmd = "python run_simulation.py 5", then eval "$cmd" runs exactly that.

Create a separate test file called tasks.txt

python run_simulation.py 1
python run_simulation.py 2
python run_simulation.py 3
python run_simulation.py 4
python run_simulation.py 5
python run_simulation.py 6
python run_simulation.py 7
python run_simulation.py 8
python run_simulation.py 9
python run_simulation.py 10
python run_simulation.py 11
python run_simulation.py 12
python run_simulation.py 13
python run_simulation.py 14
python run_simulation.py 15
python run_simulation.py 16
python run_simulation.py 17
python run_simulation.py 18
python run_simulation.py 19
python run_simulation.py 20

What this does

#SBATCH --ntasks=20 requests 20 Slurm tasks, which usually means 20 CPU cores
srun... is called once
Slurm launches 20 worker processes
each worker gets a different SLURM_PROCID from 0 to 19
sed splits tasks.txt across those 20 workers
each worker runs its assigned commands sequentially

Example: 4 parallel workers, 10 CPUs each, running 20 jobs total

In this example, you have 20 jobs to run and want to use 4 workers in parallel, each using 10 CPU cores. The 4 workers run simultaneously and divide the 20 jobs evenly — each worker handles 5 jobs sequentially. All 4 workers finish at roughly the same time.

SLURM script:

#!/bin/bash
#SBATCH --job-name=task_farm_4workers
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=10
#SBATCH --time=02:00:00
#SBATCH --mem=40G
#SBATCH --output=taskfarm_%j.out

module load python

TASKFILE=tasks.txt

srun bash -c '
TASK_ID=$SLURM_PROCID
TOTAL_TASKS=$SLURM_NTASKS

sed -n "$((TASK_ID+1))~$TOTAL_TASKS p" '"$TASKFILE"' | while IFS= read -r cmd
do
    echo "Worker ${TASK_ID} running: $cmd"
    eval "$cmd"
done
'

tasks.txt (20 jobs total):

python run_parallel.py 1
python run_parallel.py 2
python run_parallel.py 3
python run_parallel.py 4
python run_parallel.py 5
python run_parallel.py 6
python run_parallel.py 7
python run_parallel.py 8
python run_parallel.py 9
python run_parallel.py 10
python run_parallel.py 11
python run_parallel.py 12
python run_parallel.py 13
python run_parallel.py 14
python run_parallel.py 15
python run_parallel.py 16
python run_parallel.py 17
python run_parallel.py 18
python run_parallel.py 19
python run_parallel.py 20

note

You don't need to pass the number of workers explicitly — run_parallel.py reads $SLURM_CPUS_PER_TASK automatically, so it will use all 10 CPUs allocated per task. If you want to override it, you can pass it as a second argument: python run_parallel.py 1 10.

How the 20 jobs are divided across 4 workers:

Worker	`SLURM_PROCID`	Jobs assigned
Worker 0	0	job 1, 5, 9, 13, 17
Worker 1	1	job 2, 6, 10, 14, 18
Worker 2	2	job 3, 7, 11, 15, 19
Worker 3	3	job 4, 8, 12, 16, 20

Each worker runs its 5 jobs sequentially, but all 4 workers run in parallel — so all 20 jobs complete in the time it takes to run 5 jobs. To scale this to more jobs, simply add more lines to tasks.txt. To add more parallel workers, increase --ntasks.

The simulation script used in tasks.txt above is a parallel version that uses all 10 CPUs assigned to each worker via Python's multiprocessing module. Download it here:

⬇️ Download run_parallel.py

Checking Available Resources on a Mixed-State Node

A node in mix state means it is partially allocated — some CPUs and memory are in use but others are still free. Before submitting a job, you can inspect exactly how many CPUs and how much memory remain available on a specific node using scontrol:

scontrol show node <nodename>

For example:

scontrol show node gnode009

Look for these fields in the output:

Field	Meaning
`CPUTot`	Total CPUs on the node
`CPUAlloc`	CPUs currently allocated to running jobs
`CPULoad`	Actual CPU load (can differ from allocated)
`RealMemory`	Total memory (MB) on the node
`AllocMem`	Memory currently allocated
`FreeMem`	Memory currently free
`CfgTRES`	Total resources configured on the node
`AllocTRES`	Resources currently allocated

The free CPUs available for your job is CPUTot - CPUAlloc, and the free memory is RealMemory - AllocMem (in MB). Make sure your --ntasks, --cpus-per-task, and --mem combination does not exceed what is free on the node.

Finding Your "Sweet Spot": A Guide to HPC Scaling

One of the most common misconceptions in HPC is that doubling the number of cores will halve the time it takes to finish a job. In reality, efficiency often drops as you scale up. This guide will help you understand how to find the "Sweet Spot"—the point where you get the most science done without wasting cluster resources

Why "More" is not Always "Faster"

Every parallel program has two parts:

The Parallel Part: Tasks that can be split up (e.g., math on different parts of a matrix).
The Serial Part: Tasks that must happen one after another (e.g., reading an input file, starting the MPI environment, or gathering results).

As you add more cores, the Parallel Part gets faster, but the Serial Part stays the same. Eventually, the time spent on communication between cores (the "overhead") becomes larger than the time spent on actual calculation.

Amdahl's Law

This law defines the maximum speedup possible. If 10% of your code is serial, your job can never be more than 10x faster, no matter if you use 100 or 1000 cores.

How to Find the "Sweet Spot" (Step-by-Step)

To find the perfect scaling, you should run a Scaling Study before launching a massive production run.

The test run Run the same small test case on different number of cores, for example, 1,2,4,8,16,and 32 cores. Record the "wallclock time" for each
Calculate Efficiency Use this formula below to see how much efficiency you actually get
$Efficiency = \frac{T_{serial}}{N \times T_N} \times 100\%$
- $T_{serial}$ : Time taken on 1 core
- $T_N$ : Time taken on N cores

The Rule of Thumb: If your efficiency drops below 70%, you have passed the sweet spot. You are now wasting cluster "Service Units" for very little gain.

Script Example: Automating a scaling study

#!/bin/bash
# scaling_study.sh
# Runs the same job at increasing core counts and prints efficiency for each.
# Efficiency = (T_serial / (N * T_N)) * 100%

RESULTS_FILE="scaling_results.txt"
echo "Cores | Time (s) | Efficiency (%)" > $RESULTS_FILE
echo "------|----------|----------------" >> $RESULTS_FILE

T_SERIAL=""

for CORES in 1 2 4 8 16 32
do
    echo "Running with $CORES cores..."

    START=$(date +%s%N)
    srun --ntasks=1 --nodes=1 --cpus-per-task=$CORES python run_parallel.py 1 $CORES
    END=$(date +%s%N)

    # Elapsed time in seconds (floating point)
    T_N=$(echo "scale=3; ($END - $START) / 1000000000" | bc)

    # Store serial time on first iteration
    if [ -z "$T_SERIAL" ]; then
        T_SERIAL=$T_N
    fi

    # Efficiency = (T_serial / (N * T_N)) * 100
    EFFICIENCY=$(echo "scale=1; ($T_SERIAL / ($CORES * $T_N)) * 100" | bc)

    echo "$CORES     | $T_N       | $EFFICIENCY%" | tee -a $RESULTS_FILE
done

echo ""
echo "=== Scaling Study Complete ==="
cat $RESULTS_FILE

Installing packages using conda​

Why Use Conda on HPC?​

Understanding GPU Nodes on HPC​

Step 1: Check Available GPU Partitions​

Step 2: Request an Interactive GPU Node​

What each flag means​

Verify the GPU is visible​

Step 3: Load the Conda Module​

Step 4: Create a New Conda Environment​

What this does​

Step 5: Install a GPU Package — Numba​

Install Numba with CUDA support​

What is being installed​

Step 6: Verify the GPU Installation​

Managing Your Conda Environment​

List all your environments​

List packages in active environment​

Deactivate an environment​

Remove an environment (to free storage)​

Check how much space an environment uses​

Check overall storage quota​

Common Issues and Solutions​

Running Multiple Jobs Without Job Arrays​

Overview​

Method 1: Background Tasks in a Single Job​

How This Works​

Method 2: Using srun for Better Resource Control​

What this does​

Checking Available Resources on a Mixed-State Node​

Finding Your "Sweet Spot": A Guide to HPC Scaling​

Why "More" is not Always "Faster"​

Amdahl's Law​

How to Find the "Sweet Spot" (Step-by-Step)​

Script Example: Automating a scaling study​

Installing packages using conda

Why Use Conda on HPC?

Understanding GPU Nodes on HPC

Step 1: Check Available GPU Partitions

Step 2: Request an Interactive GPU Node

What each flag means

Verify the GPU is visible

Step 3: Load the Conda Module

Step 4: Create a New Conda Environment

What this does

Step 5: Install a GPU Package — Numba

Install Numba with CUDA support

What is being installed

Step 6: Verify the GPU Installation

Managing Your Conda Environment

List all your environments

List packages in active environment

Deactivate an environment

Remove an environment (to free storage)

Check how much space an environment uses

Check overall storage quota

Common Issues and Solutions

Running Multiple Jobs Without Job Arrays

Overview

Method 1: Background Tasks in a Single Job

How This Works

Method 2: Using `srun` for Better Resource Control

What this does

Checking Available Resources on a Mixed-State Node

Finding Your "Sweet Spot": A Guide to HPC Scaling

Why "More" is not Always "Faster"

Amdahl's Law

How to Find the "Sweet Spot" (Step-by-Step)

Script Example: Automating a scaling study