This documentation is under active development, meaning that it can change over time as we refine it. Please email if you require assistance, or have suggestions to improve this documentation.

DGX on M3#


DGX-1V is NVIDIA’s purpose built Deep Learning server.

Each box consists of:

  • 8 x Tesla V100-SXM2 (32GB RAM each)

  • NVIDIA CUDA Cores: 40,960

  • 960 TFLOPS (GPU FP16)

  • NVLink within each box, provides rapid inter-GPU communication

  • 2 x 20-cores Intel Xeon E-2698 v4 2.2 GHz

  • 512 GB System 2133 Mhz DDR4

  • 4x IB 100Gb/s (EDR), Dual 10 GbE

How to access the DGX hardware#

The DGX hardware is currently specialized, and to gain access users must make a request via the form.

This will allow us to:

  • keep track of and review the allocations;

  • monitor the usage patterns, based on your research; and

  • start engaging the community and provide support.

Access is via our SLURM scheduler, and approved users must specify the following in their submission scripts.

#SBATCH --partition=dgx
#SBATCH --qos=dgx

The max walltime is 24hrs (which is equivalent to more than 7 days of workstation run time)

What jobs are suitable?#

For a job to run on the DGX, it must make use of the specialized hardware of this machine, i.e. we would like to see it have the following attributes:

  • use NVLink

  • high efficiency multi-GPU >3

  • checkpointable analysis

  • command line (non-GUI)

  • scales efficiently

If it does not meet these requirements, users are recommended to use the other GPU hardware provided by M3.

Some jobs that would be suitable include:

  • Deep learning
    • Tensorflow (Note that the built-in function to scale out the training is slower than other tools)

    • Horovod

    • Daali libraries

    • Pytorch

  • Cryo-EM
    • Relion

    • Motioncor

    • Cryolo

And potentially others.

For more information of checkpointing with some machine learning tools, please see our machine learning community documentation.

How do you demonstrate a suitable project/job?#

The overall principle is to leverage the specialised nature of the hardware.


  • Run jobs on V100 nodes with monitoring of GPU, CPU and I/O performance

  • Show restart/resume scripts and techniques

  • Show scaling of code

We can assist you with your job script to log the GPU & CPU usage in order to improve the performance of your analysis.