This documentation is under active development, meaning that it can change over time as we refine it. Please email firstname.lastname@example.org if you require assistance, or have suggestions to improve this documentation.
DGX on M3¶
DGX-1V is NVIDIA’s purpose built Deep Learning server.
Each box consists of:
8 x Tesla V100-SXM2 (32GB RAM each)
NVIDIA CUDA Cores: 40,960
960 TFLOPS (GPU FP16)
NVLink within each box, provides rapid inter-GPU communication
2 x 20-cores Intel Xeon E-2698 v4 2.2 GHz
512 GB System 2133 Mhz DDR4
4x IB 100Gb/s (EDR), Dual 10 GbE
How to access the DGX hardware¶
The DGX hardware is currently specialized, and to gain access users must make a request via the form.
This will allow us to:
keep track of and review the allocations;
monitor the usage patterns, based on your research; and
start engaging the community and provide support.
Access is via our SLURM scheduler, and approved users must specify the following in their submission scripts.
#SBATCH --partition=dgx #SBATCH --qos=dgx
The max walltime is 24hrs (which is equivalent to more than 7 days of workstation run time)
What jobs are suitable?¶
For a job to run on the DGX, it must make use of the specialized hardware of this machine, i.e. we would like to see it have the following attributes:
high efficiency multi-GPU >3
command line (non-GUI)
If it does not meet these requirements, users are recommended to use the other GPU hardware provided by M3.
Some jobs that would be suitable include:
- Deep learning
Tensorflow (Note that the built-in function to scale out the training is slower than other tools)
And potentially others.
For more information of checkpointing with some Deep learning tools, please see https://docs.massive.org.au/communities/deep-learning.html
How do you demonstrate a suitable project/job?¶
The overall principle is to leverage the specialised nature of the hardware.
Run jobs on V100 nodes with monitoring of GPU, CPU and I/O performance
Show restart/resume scripts and techniques
Show scaling of code
We can assist you with your job script to log the GPU & CPU usage in order to improve the performance of your analysis.