Attention

This documentation is under active development, meaning that it can change over time as we refine it. Please email help@massive.org.au if you require assistance, or have suggestions to improve this documentation.

Machine Learning on M3

Software

There are a number of machine learning packages available on M3.

Caffe

To use Caffe on M3:

nvidia-modprobe -u
module load caffe
your-caffe-script-here

TensorFlow

To use TensorFlow on M3:

# Loading module
module load tensorflow/2.0.0-beta1

# Unloading module
module unload tensorflow/2.0.0-beta1

PyTorch

To use PyTorch on M3:

# Loading module
module load pytorch/1.3-cuda10

# Unloading module
module unload pytorch/1.3-cuda10

Keras

Keras uses Tensorflow as a backend, and has advised users to use tf.keras going forward as it is better maintained and integrates well with Tensorflow features. To read more about these recommendations, see https://keras.io/.

This means to use Keras on M3, you will need to load Tensorflow:

# Loading module
module load tensorflow/2.0.0-beta1

Then in your Python code, import Keras from Tensorflow and code as usual.

# Import Keras
from tensorflow import keras

# Coding in Keras here

Scikit-learn

Scikit-learn comes installed on M3 and can be imported as usual in Python without any extra steps.

Reference datasets

MASSIVE hosts Machine Learning related data collections in the interest of reducing the pressure on user storage, minimising download wait times, and providing easy access to researchers. Currently hosted data collections are listed below, and we are open to hosting others that are valuable to the community. If you would like to request a data collection be added, or see more information about the data collections hosted on M3, see Data Collections on M3.

Machine learning data collections on M3

Version

Date of download

ImageNet

Fall 2011

International Skin Imaging Collaboration 2019 (ISIC 2019)

August 2019

2020-10-02

NIH Chest X-ray Dataset (NIH CXR-14)

2017-09-26

2020-10-29

Quick guide for checkpointing

Why checkpointing?

Checkpoints in Machine/Deep Learning experiments prevent you from losing your experiments due to maximum walltime reached, blackout, OS faults or other types of bad errors. Sometimes you want just to resume a particular state of the training for new experiments or try different things.

Pytorch:

First of all define a save_checkpoint function which handles all the instructions about the number of checkpoints to keep and the serialization on file:

def save_checkpoint(state, condition, filename='/output/checkpoint.pth.tar'):
   """Save checkpoint if the condition achieved"""
   if condition:
       torch.save(state, filename)  # save checkpoint
   else:
       print ("=> Validation condition not meet")

Then, inside the training (usually a for loop with the number of epochs), we define the checkpoint frequency (at the end of every epoch) and the information (epochs, model weights and best accuracy achieved) we want to save:

# Training the Model
for epoch in range(num_epochs):
   train(...)  # Train
   acc = eval(...)  # Evaluate after every epoch

   # Some stuff with acc(accuracy)
   ...
   # Get bool not ByteTensor
   is_best = bool(acc.numpy() > best_accuracy.numpy())
   # Get greater Tensor to keep track best acc
   best_accuracy = torch.FloatTensor(max(acc.numpy(), best_accuracy.numpy()))
   # Save checkpoint if is a new best
   save_checkpoint({
       'epoch': start_epoch + epoch + 1,
       'state_dict': model.state_dict(),
       'best_accuracy': best_accuracy
   }, is_best)

To resume a checkpoint, before the training we have to load the weights and the meta information we need:

checkpoint = torch.load(resume_weights)
start_epoch = checkpoint['epoch']
best_accuracy = checkpoint['best_accuracy']
model.load_state_dict(checkpoint['state_dict'])
print("=> loaded checkpoint '{}' (trained for {} epochs)".format(resume_weights, checkpoint['epoch']))

Keras

Keras provides a set of functions called callback: you can think of it as events that triggers at certain training state. The callback we need for checkpointing is the ModelCheckpoint which provides all the features we need according to the checkpoint strategy adopted.

from keras.callbacks import ModelCheckpoint
# Checkpoint In the /output folder
filepath = "/output/mnist-cnn-best.hdf5"

# Keep only a single checkpoint, the best over test accuracy.
checkpoint = ModelCheckpoint(filepath,
                          monitor='val_acc',
                          verbose=1,
                          save_best_only=True,
                          mode='max')
# Train
model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              verbose=1,
              validation_data=(x_test, y_test),
              callbacks=[checkpoint])  # <- Apply our checkpoint strategy

Keras models have the load_weights() method which load the weights from a hdf5 file. To load the model’s weight you have to add this line just after the model definition:

... # Model Definition
model.load_weights(resume_weights)

Tensorflow

Tensorflow 2 encourages uses to take advantage of the Keras API, and likewise uses Keras to save checkpoints while training a model. As described above, Keras provides a set of functions called callback: you can think of callbacks as events that triggers at a certain training state. The callback we need for checkpointing is the ModelCheckpoint.

This code shows how to save the weights of a model at regular epoch intervals using the Keras API in Tensorflow 2.

# Define where the checkpoints should be stored
# This saves the checkpoints from each epoch with a corresponding name
checkpoint_filename = "./checkpoints/cp--{epoch:04d}.cpkt"
checkpoint_dir = os.path.dirname(checkpoint_filename)

# Create callbacks to save the weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_filename,
                                                 verbose = 1,
                                                 save_weights_only = True,
                                                 # How often to save
                                                 save_freq = 5)

# Train the model, using our callback to save the weights every 5 epochs
model.fit(x_train, y_train,
          batch_size = batch_size,
          epochs = epochs,
          validation_data = (x_test, y_test),
          callbacks = [cp_callback])

Once we’ve done this, we can load the weights into a model to resume training or start a new model with them.

# To load the weights into a model, first get the latest checkpoint
latest_cp = tf.train.latest_checkpoint(checkpoint_dir)

# Load the latest weights into the model from the checkpoint
new_model.load_weights(latest_cp)