Attention

This documentation is under active development, meaning that it can change over time as we refine it. Please email help@massive.org.au if you require assistance, or have suggestions to improve this documentation.

Machine Learning on M3#

This page gathers some of our documentation on common tools and resources the machine learning community on M3 might find useful. This documentation has recently been updated, please reach out at help@massive.org.au with feedback or suggestions on how we can improve this documentation.

Software#

In the HPC environment, we traditionally encourage software access via centrally available modules - these ensure that on the shared system there are no software conflicts between individuals. For example, if one user needs Python 3.6 and another requires Python 3.8., modules prevent conflict. However, modules are immutable - while we can install new modules, we never change existing modules in order to maintain the scientific reproducibility of workflows like job submission scripts. This can create difficulty for the machine learning community, where updating Python environments with new libraries happens frequently.

While we provide software modules on M3 for common machine learning software such as TensorFlow, PyTorch, Caffe and Keras, it’s likely you will want to create your own Python environment to install packages into as needed.

You can learn more in our documentation about software modules on M3 and Python and conda environments on M3.

Reference datasets#

MASSIVE hosts Machine Learning related data collections in the interest of reducing the pressure on user storage, minimising download wait times, and providing easy access to researchers. Currently hosted data collections are listed below, and we are open to hosting others that are valuable to the community. If you would like to request a data collection be added, or see more information about the data collections hosted on M3, see Data Collections on M3.

Machine learning data collections on M3

Version

Date of download

ImageNet 2012 (ILSVRC2012)

May 2012

2021-03-03

ImageNet 2015 Object Detection Data (ILSVRC2015 DET)

September 2014

2021-10-01

International Skin Imaging Collaboration 2019 (ISIC 2019)

August 2019

2020-10-02

NIH Chest X-ray Dataset (NIH CXR-14)

2017-09-26

2020-10-29

Stanford Natural Language Inference (SNLI) Corpus

August 2015, v1.0

2021-04-21

COCO (Common Objects in Context)

September 2017

2021-07-13

AlphaFold - Genetic Sequencing Databases

July 2021

2021-07-26

GPUs on M3#

Advances in deep learning and neural networks mean that many models perform best on Graphical Processing Units (GPUs). M3 provides access to a range of GPUs which may be relevant to the deep learning community. You can read more about the GPUs on M3 in our existing documentation, or look over our GPU benchmarks in the FAQ to compare how our GPUs perform on machine learning tasks.

We have also put together some examples for running machine learning GPU jobs on M3, which include advice on how to transition from Desktop GPUs to job submission scripts, in the ML4AU GitHub repository.

For some additional advice on using GPUs on M3:

  • Ensure when running GPU enabled code, you are using a node with a GPU. You can do this via a desktop, interactive smux job or job submission.,

  • If your code isn’t performing as expected on the GPU, ensure you have enabled things such as TensorFlow-GPU, CUDA and cuDNN if required. One easy way to do this is by using the modules installed on M3.

    module avail cuda
    
    ----------------- /usr/local/Modules/modulefiles -----------------
    cuda/10.0          cuda/6.0           cuda/8.0-DL
    cuda/10.1(default) cuda/7.0           cuda/9.0
    cuda/11.0          cuda/7.5           cuda/9.1
    cuda/4.1           cuda/8.0           cudadeconv/1.0
    cuda/4.1.bajk      cuda/8.0.61        cudalibs8to9/0.1
    
    module avail cudnn
    
    ----------------- /usr/local/Modules/modulefiles ------------------
    cudnn/5.1             cudnn/7.1.3-cuda9     cudnn/7.6.5-cuda10.1
    cudnn/5.1-DL          cudnn/7.3.0-cuda9     cudnn/8.0.5-cuda10.1
    cudnn/7.1.2-cuda8     cudnn/7.6.5.32-cuda10 cudnn/8.0.5-cuda11
    
  • Take advantage of tools like nvidia-smi to monitor GPU usage on the command line.

  • The type of GPU you request will depend in your needs. A common roadmap for a machine learning users as they progress is as follows:

    If you’re just starting and need any GPU to prototype your code, and don’t want to wait in the queue: Start a desktop, terminal, or JupyterLab session with Strudel 2, and request a P4 or single T4. P4 availability is high which reduces queue times, and their compute power is sufficient for many machine learning and deep learning tasks.

    If you need a more powerful GPU and require interactivity: If you require interactivity but a P4 or T4 desktop isn’t sufficient (not enough RAM, GPU RAM, etc.,) then consider requesting an A40 desktop. If you require an interactive desktop with multiple GPUs for testing, the dual T4 desktop is also available, though less suitable for machine learning.

    If you need a more powerful GPU and you can afford to wait in the queue for it: Once you have a working model or require more advanced GPU hardware, accessing the V100, or A40 GPUs via an interactive session or by submitting jobs to the queue is appropriate. These GPUs are in high demand, so there is a wait time associated with accessing them.

    If you need more GPU RAM (vRAM): Some of our V100 GPU nodes have V100 GPUs with 32GB of GPU RAM available, which you can learn how to request in our GPU look-up tables. Additionally, the A40 GPUs offer 48GB of GPU RAM per GPU.

    If you want multiple GPUs: There are 3 V100 GPUs per node on the m3g partition, allowing you to distribute your deep learning or machine learning code across 3 GPUs. Additionally, you can access 4 A40 GPUs per node. While there are 8 T4 GPUs per node, these are less suitable for machine learning workloads - it is likely fewer V100s or A40s will yield higher performance. If you require more than 4 GPUs, see below.

    If 4 GPUs isn’t enough for you: Consider applying for access to the DGX. Each DGX node offers 8 GPUs which are specialized for machine learning workloads. When applying, provide evidence that your code has scaled to run on multiple V100 GPUs, as the DGX requires you to use a minimum of 4 GPUs at once.

Training#

Monash Data Fluency offers training courses which are relevant to the machine learning community. This includes:

  • Introduction to Python

  • Image analysis in Python with SciPy and scikit-image

  • Introduction to TensorFlow and Deep Learning

  • Introduction to Web Scraping

  • Deep Learning for Natural Language Processing

You can learn more about the workshops and sign up for the Data Fluency newsletter on their website.

Some of the training content which has been developed is available publicly in this Github repository.

Community Engagement#

The MASSIVE M3 Biannual Machine Learning Community Meetings

MASSIVE holds biannual community meetings in May and November each year, where we provide updates on work being done to better support the machine learning community using M3, share upcoming hardware procurement news, and gather feedback on tools and resources the community may want to raise. Invitations to these events are sent to all users of MASSIVE prior to the events.

ML4AU: The national machine learning community of practice for researchers

The Monash eResearch Centre and Data Science & AI Platform has been working with national partners to forge a community of practice for researchers using machine learning, ML4AU. You can learn more about the community of practice at the community website, https://www.ml4au.community/.

Quick guide for checkpointing#

Why checkpointing?#

Checkpointing refers to the practice of saving the outputs of your running code periodically. In a HPC environment, checkpointing ensures you will be able to resume training your models if you run into trouble with maximum walltime being reached, blackouts, OS faults or other errors. Checkpointing also enables you to resume earlier states in training, which can be helpful when running experiments or in the case of model overfitting. Different libraries provide various inbuilt methods for checkpointing your code, we’ve provided some here. These may not be the best way to checkpoint your code, and we encourage you to do some reading and research to find what works best for you.

PyTorch:#

First of all define a save_checkpoint function which handles all the instructions about the number of checkpoints to keep and the serialization on file:

def save_checkpoint(state, condition, filename='/output/checkpoint.pth.tar'):
   """Save checkpoint if the condition achieved"""
   if condition:
       torch.save(state, filename)  # save checkpoint
   else:
       print ("=> Validation condition not meet")

Then, inside the training (usually a for loop with the number of epochs), we define the checkpoint frequency (at the end of every epoch) and the information (epochs, model weights and best accuracy achieved) we want to save:

# Training the Model
for epoch in range(num_epochs):
   train(...)  # Train
   acc = eval(...)  # Evaluate after every epoch

   # Some stuff with acc(accuracy)
   ...
   # Get bool not ByteTensor
   is_best = bool(acc.numpy() > best_accuracy.numpy())
   # Get greater Tensor to keep track best acc
   best_accuracy = torch.FloatTensor(max(acc.numpy(), best_accuracy.numpy()))
   # Save checkpoint if is a new best
   save_checkpoint({
       'epoch': start_epoch + epoch + 1,
       'state_dict': model.state_dict(),
       'best_accuracy': best_accuracy
   }, is_best)

To resume a checkpoint, before the training we have to load the weights and the meta information we need:

checkpoint = torch.load(resume_weights)
start_epoch = checkpoint['epoch']
best_accuracy = checkpoint['best_accuracy']
model.load_state_dict(checkpoint['state_dict'])
print("=> loaded checkpoint '{}' (trained for {} epochs)".format(resume_weights, checkpoint['epoch']))

Keras#

Keras provides a set of functions called callback: you can think of it as events that triggers at certain training state. The callback we need for checkpointing is the ModelCheckpoint which provides all the features we need according to the checkpoint strategy adopted.

from keras.callbacks import ModelCheckpoint
# Checkpoint In the /output folder
filepath = "/output/mnist-cnn-best.hdf5"

# Keep only a single checkpoint, the best over test accuracy.
checkpoint = ModelCheckpoint(filepath,
                          monitor='val_acc',
                          verbose=1,
                          save_best_only=True,
                          mode='max')
# Train
model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              verbose=1,
              validation_data=(x_test, y_test),
              callbacks=[checkpoint])  # <- Apply our checkpoint strategy

Keras models have the load_weights() method which load the weights from a hdf5 file. To load the model’s weight you have to add this line just after the model definition:

... # Model Definition
model.load_weights(resume_weights)

TensorFlow#

TensorFlow 2 encourages uses to take advantage of the Keras API, and likewise uses Keras to save checkpoints while training a model. As described above, Keras provides a set of functions called callback: you can think of callbacks as events that triggers at a certain training state. The callback we need for checkpointing is the ModelCheckpoint.

This code shows how to save the weights of a model at regular epoch intervals using the Keras API in TensorFlow 2.

# Define where the checkpoints should be stored
# This saves the checkpoints from each epoch with a corresponding name
checkpoint_filename = "./checkpoints/cp--{epoch:04d}.cpkt"
checkpoint_dir = os.path.dirname(checkpoint_filename)

# Create callbacks to save the weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_filename,
                                                 verbose = 1,
                                                 save_weights_only = True,
                                                 # How often to save
                                                 save_freq = 5)

# Train the model, using our callback to save the weights every 5 epochs
model.fit(x_train, y_train,
          batch_size = batch_size,
          epochs = epochs,
          validation_data = (x_test, y_test),
          callbacks = [cp_callback])

Once we’ve done this, we can load the weights into a model to resume training or start a new model with them.

# To load the weights into a model, first get the latest checkpoint
latest_cp = tf.train.latest_checkpoint(checkpoint_dir)

# Load the latest weights into the model from the checkpoint
new_model.load_weights(latest_cp)

PyTorch, TensorFlow and CUDA#

A user’s guide to installing and using these package is given HERE.