Skip to main content

Best Practices - GPUs on Pinnacles

info

This guide provides an overview of the GPU resources available on the Pinnacles cluster at UC Merced. GPUs offer accelerated performance for machine learning, scientific computing, and data-intensive workloads.

Accessing and Running GPU Jobs

warning

GPU usage on Pinnacles has been extremely high recently. To ensure fair access and maximize resource efficiency, please only request the amount of GPU resources you actually need.

Please be aware that CIRT may terminate GPU jobs that are significantly underutilizing or not actively using the GPUs, in order to free resources for other users.

GPU Allocation

Users must request GPU resources explicitly in their job submissions using the appropriate Slurm directives.

Submission Guidelines

Below are some example directives that may be used when preparing Slurm Job Scripts or Interactive Sessions.

DirectivePurpose
--gres=gpu:NSpecify to Slurm you will require use of GPU and how many (1 or 2).
Optional Directives BelowMay be used as another layer of control/optimization
--constraint=<feature>Specify which GPU type (A100, L40s, etc), useful when running on test partition
--cpus-per-gpu=<N>Reserve N CPU Threads per GPU
--gpus-per-node=<N>When performing multinode jobs, specify how many GPUs per node
--mem-per-gpu=NN (GB\MB) RAM to be proportioned per GPU

For overall Slurm directives and submission scripts, please see here

GPU Types

Available GPU Architecture

Pinnacles provides access to various GPU models optimized for different computational needs. The cluster includes GPUs suitable for:

  • Deep learning and AI workloads
  • Scientific simulations
  • Data analytics and visualization
  • General-purpose GPU computing (GPGPU)

GPU Comparison Table

GPUTechnical NotesBest Use Cases
A100High Precision (FP64)Data/Computational Intensive workloads that require higher numerical precision
L40sLower Precision (FP32),Machine learning/deep learning training, AI Workloads

Harnessing GPUs for Machine Learning

Key Considerations

  • Framework Selection: Choose GPU-optimized frameworks that leverage CUDA
  • Batch Sizing: Balance between GPU memory limits and training efficiency

Resource Planning

  • Estimate GPU memory requirements based on model architecture.
  • Consider multi-GPU strategies for large-scale training.
  • Plan for checkpointing for fault tolerance.
  • Monitor GPU utilization to ensure efficient resource usage. After logging into a GPU node, run the following command:
nvidia-smi

Harnessing GPUs for Scientific Computing

Application Domains

GPUs excel in scientific computing applications including:

  • Computational fluid dynamics
  • Molecular dynamics simulations
  • Climate and weather modeling
  • Bioinformatics and genomics
  • Physics simulations
  • Image and signal processing

Optimization Strategies

  • Identify components to parallelize of algorithms or program.
  • Minimize CPU-GPU data transfers
  • Utilize GPU-accelerated libraries when available
  • Consider domain-specific GPU implementations

Common and Supported Frameworks

Machine Learning Frameworks

The following frameworks are commonly used on Pinnacles GPUs:

  • PyTorch: Dynamic computational graphs, research-friendly
  • TensorFlow: Production-ready, extensive ecosystem

Scientific Computing Libraries

  • CUDA Toolkit: NVIDIA GPU programming platform
  • cuDNN: Deep learning primitives library
  • cuBLAS: GPU-accelerated BLAS operations

Best Practices and Performance Tips

Best Practices for Pinnacles

  • Share GPUs when possible through efficient job scheduling. Do not reserve/allocate more than the job will be anticipated to use.
  • Right-sizing: Request only the GPU resources you need

Job Optimization

  • Profiling: Use profiling tools to identify bottlenecks
  • Queue Selection: Select gpu queue for use of A100 GPUs, select cenvalarc.gpu for L40s GPUs.

Additional Resources

Zero GPU Utilization

Below are three common reasons why a user may encounter 0% GPU utilization:

Is your code GPU-enabled?

Only codes that have explicit GPU support will run successfully on a GPU. Please consult the documentation for your software to find out whether it is suitable for GPU utilization. If your code is not GPU-enabled then please remove the --gres and other GPU-related directives before resubmitting the job.

Ensure software environment is properly configured.

In some cases, specific libraries are required for your code to run on GPUs. Make sure the necessary software toolkits or libraries are loaded from the cluster’s available modules. If not available, ensure you have installed and properly sourced them in your own environment.

Low GPU Utilization: Potential Solutions

If you encounter low GPU utilization (e.g., less than 50%) then please investigate the reasons for the low utilization. Some common reasons are outlined below:

Misconfigured application scripts.

Be sure to read the documentation of the software to make sure that you are using it properly. This includes creating the appropriate software environment.

Training deep learning models while only using a single CPU-cor for dataloading.

Codes such as PyTorch and TensorFlow show performance benefits when multiple CPU-cores are used for the data loading.

Using too many GPUs for a job. Performing a scaling analysis is useful to identify an ideal middle ground.

Common Mistakes

The most common mistake is running a CPU-only code on a GPU node. Only codes that have been explicitly written to run on a GPU can take advantage of a GPU. Read the documentation for the code that you are using to see if it can use a GPU.

Another common mistake is to run a code that is written to work for a single GPU on multiple GPUs.

Documentation and Tutorials

Support Channels