Attention

This documentation is under active development, meaning that it can change over time as we refine it. Please email help@massive.org.au if you require assistance, or have suggestions to improve this documentation.

PyTorch and CUDA#

When using PyTorch on M3/MonARCH, you do not need to load any CUDA modules to take advantage of the GPU. This is because PyTorch bundles its own CUDA so long as you follow their instructions for pip or conda installing with CUDA. As an example from those instructions, you could use pip to install PyTorch with CUDA 11.8 like so:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
../../_images/pytorch-pip-install-screenshot.png

TensorFlow and CUDA#

WARNING: As of 18/Jan/2024, TF >= 2.12 will likely not work with CUDA on our GPU nodes. This is due to our NVIDIA drivers being out-of-date. If you want to know the maximal CUDA allowed on a particular GPU node, connect to that node and run nvidia-smi, and look for CUDA Version: X.Y.

Until TensorFlow (TF) v2.15, TF did not bundle CUDA in their pip installs, which means you need to module load CUDA and cuDNN in order to take advantage of CUDA acceleration on our GPU nodes.

It’s important that you load *compatible* versions of CUDA and cuDNN. In order to identify compatible versions of CUDA and cuDNN (and Python), see this table in TF’s docs . This table lists the tested configurations of TF and CUDA versions. For example, this table confirms that TF 2.11.0 is compatible with Python 3.7-3.10, CUDA 11.2, and cudNN 8.1.

../../_images/tensorflow-cuda-compat-table.png

You may need to use slightly different versions of CUDA and cuDNN based on what’s available on our cluster (module avail cuda or module avail cudnn to list our versions). A rule of thumb is:

  • Use a CUDA with the same major version and a greater or equal minor version. For example, if the table says CUDA 11.2, you can try using any CUDA >=11.2 but <12.

  • Same goes for cuDNN. Watch out though because some of our cuDNN modules load a specific CUDA version, which may conflict with your desired version!

Using an existing TF module#

We already have some TF modules available. If you only want TF and nothing else, then these may be suitable for your needs, and most of them should automatically handle the CUDA loading for you.:

[lexg@m3-login1 tensorflow-cuda]$ module avail tensorflow

-------------------------------------------------------------------------------------- /usr/local/Modules/modulefiles --------------------------------------------------------------------------------------
tensorflow/1.0.0-python2.7.12-gcc5  tensorflow/1.12.0-python2.7.12-gcc5 tensorflow/1.14.0-keras-pydicom     tensorflow/1.4.0-python2.7.12-gcc5  tensorflow/2.2.0
tensorflow/1.10.0-pytorch           tensorflow/1.12.0-python3.6-gcc5    tensorflow/1.15.2-gpu               tensorflow/1.4.0-python3.6-gcc5     tensorflow/2.3.0
tensorflow/1.10.0-pytorch-all       tensorflow/1.13.1-gdal              tensorflow/1.15.2-python3.7.3-gcc8  tensorflow/2.0.0-beta1              tensorflow/2.4.1
tensorflow/1.10.0-pytorch-keras     tensorflow/1.14.0-keras             tensorflow/1.3.0-python2.7.12-gcc5  tensorflow/2.0.0-gpu(default)

# This TF module is already configured with CUDA, so it should just work on any of our GPU nodes
[lexg@m3-login1 tensorflow-cuda]$ module load tensorflow/2.4.1

Creating an environment with venv + pip#

To create a Python virtual environment with TF 2.11 installed, you might do:

# Load some Python version compatible with your desired TF version
module load python/3.9.10-linux-centos7-haswell-gcc10.2.0
# Navigate to where you'd like the virtual environment (i.e. not your HOME directory! It will fill up quickly)
cd /path/to/my-project/
# Create an empty Python virtual environment (venv)
python -m venv tensorflow-env
# Activate your environment
source tensorflow-env/bin/activate
# Install desired TF version. Note that we include the prompt below which  has changed due to the virtenv
(tensorflow-env)$ pip install tensorflow==2.11

Then to actually use this virtual environment with CUDA enabled, you would connect to a GPU node and do something like:

# Load same version of Python as you used to build the venv, and activate your venv
module load python/3.9.10-linux-centos7-haswell-gcc10.2.0
cd /path/to/my-project/
source tensorflow-env/bin/activate
# Note that we include the prompt below which  has changed due to the virtenv
(tensorflow-env)$ module load cuda/11.3 cudnn/8.2.4
# Test to see if TF finds the current GPU(s). The final output should be a list of GPU devices, which we see below
(tensorflow-env)$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# TF has very verbose output which I'm omitting, look out for ERRORS though (in TF logging, I="info", W="warning", E="error"). Warnings may also be of interest, but are usually not catastrophic
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

If you only see an empty list [] as the final output, that means the GPU was not recognised. This may be because:

  • You are not on a GPU node. If nvidia-smi returns “Command not found”, then you are not on a GPU node.

  • You have not activated your environment. Try which python and verify that it’s pointing to your virtual environment.

  • Something else has gone wrong. Look carefully at the the output to see if any errors were logged.

As of TensorFlow 2.15 , you may run pip install tensorflow[and-cuda] and TensorFlow will bundle its own CUDA for you, meaning you will not have to module load cuda to get TensorFlow to use the GPU. HOWEVER, this is pointless unless the GPU node you are using has up-to-date drivers. Please inform us if this documentation is out of date!

Creating an environment with conda#

Conda is an alternative to using python -m venv. It has the advantage in this case that you can install your own CUDA version within your environment, so you don’t need to module load cuda when using this environment. e.g. to install TF 2.11

# Here I load mamba, which is a *faster* drop-in replacement for conda
module load miniforge3
# Create the environment in the current working directory (don't do this in HOME! It will fill up quickly) with CUDA and cuDNN included
mamba create -y --prefix=$PWD/tensorflow-2.11-conda -c conda-forge python=3.9 tensorflow-gpu=2.11 cudatoolkit=11.2 cudnn=8.1
# Bit awkward to *activate* environments using this miniforge3 module, you can swap to an anaconda module at this point if you prefer
eval "$(conda shell.bash hook)"
conda activate ./tensorflow-2.11-conda
# Assuming you're on a GPU node, you can test if TF can load CUDA and detect the GPU.
# Note that we include the prompt below which  has changed due to the virtenv
(tensorflow-2.11-conda)$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# I'll omit some output, only care about this last line
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

We recommend using the conda-forge channel over the default anaconda channel, since conda-forge tends to be more up-to-date.