Attention

This documentation is under active development, meaning that it can change over time as we refine it. Please email help@massive.org.au if you require assistance, or have suggestions to improve this documentation.

DGX on M3#

Hardware#

DGX-1V is NVIDIA’s purpose built Deep Learning server.

Each box consists of:

8 x Tesla V100-SXM2 (32GB RAM each)
NVIDIA CUDA Cores: 40,960
960 TFLOPS (GPU FP16)
NVLink within each box, provides rapid inter-GPU communication
2 x 20-cores Intel Xeon E-2698 v4 2.2 GHz
512 GB System 2133 Mhz DDR4
4x IB 100Gb/s (EDR), Dual 10 GbE

How to access the DGX hardware#

The DGX hardware is currently specialized, and to gain access users must make a request via the form.

This will allow us to:

keep track of and review the allocations;
monitor the usage patterns, based on your research; and
start engaging the community and provide support.

Access is via our SLURM scheduler, and approved users must specify the following in their submission scripts.

#SBATCH --partition=dgx
#SBATCH --qos=dgx

The max walltime is 24hrs (which is equivalent to more than 7 days of workstation run time)

What jobs are suitable?#

For a job to run on the DGX, it must make use of the specialized hardware of this machine, i.e. we would like to see it have the following attributes:

use NVLink
high efficiency multi-GPU >3
checkpointable analysis
command line (non-GUI)
scales efficiently

If it does not meet these requirements, users are recommended to use the other GPU hardware provided by M3.

Some jobs that would be suitable include:

Deep learning
- Tensorflow (Note that the built-in function to scale out the training is slower than other tools)
- Horovod
- Daali libraries
- Pytorch
Cryo-EM
- Relion
- Motioncor
- Cryolo

And potentially others.

For more information of checkpointing with some machine learning tools, please see our machine learning community documentation.

How do you demonstrate a suitable project/job?#

The overall principle is to leverage the specialised nature of the hardware.

Specifics:

Run jobs on V100 nodes with monitoring of GPU, CPU and I/O performance
Show restart/resume scripts and techniques
Show scaling of code

We can assist you with your job script to log the GPU & CPU usage in order to improve the performance of your analysis.