Attention

This documentation is under active development, meaning that it can change over time as we refine it. Please email help@massive.org.au if you require assistance, or have suggestions to improve this documentation.

DGX on M3#

Hardware#

DGX-1V is NVIDIA’s purpose built Deep Learning server.

Each box consists of:

  • 8 x Tesla V100-SXM2 (32GB RAM each)

  • NVIDIA CUDA Cores: 40,960

  • 960 TFLOPS (GPU FP16)

  • NVLink within each box, provides rapid inter-GPU communication

  • 2 x 20-cores Intel Xeon E-2698 v4 2.2 GHz

  • 512 GB System 2133 Mhz DDR4

  • 4x IB 100Gb/s (EDR), Dual 10 GbE

How to access the DGX hardware#

The DGX hardware is currently specialized, and to gain access users must make a request via the form.

This will allow us to:

  • keep track of and review the allocations;

  • monitor the usage patterns, based on your research; and

  • start engaging the community and provide support.

Access is via our SLURM scheduler, and approved users must specify the following in their submission scripts.

#SBATCH --partition=dgx
#SBATCH --qos=dgx

The max walltime is 24hrs (which is equivalent to more than 7 days of workstation run time)

What jobs are suitable?#

For a job to run on the DGX, it must make use of the specialized hardware of this machine, i.e. we would like to see it have the following attributes:

  • use NVLink

  • high efficiency multi-GPU >3

  • checkpointable analysis

  • command line (non-GUI)

  • scales efficiently

If it does not meet these requirements, users are recommended to use the other GPU hardware provided by M3.

Some jobs that would be suitable include:

  • Deep learning
    • Tensorflow (Note that the built-in function to scale out the training is slower than other tools)

    • Horovod

    • Daali libraries

    • Pytorch

  • Cryo-EM
    • Relion

    • Motioncor

    • Cryolo

And potentially others.

For more information of checkpointing with some machine learning tools, please see our machine learning community documentation.

How do you demonstrate a suitable project/job?#

The overall principle is to leverage the specialised nature of the hardware.

Specifics:

  • Run jobs on V100 nodes with monitoring of GPU, CPU and I/O performance

  • Show restart/resume scripts and techniques

  • Show scaling of code

We can assist you with your job script to log the GPU & CPU usage in order to improve the performance of your analysis.