This documentation is under active development, meaning that it can change over time as we refine it. Please email email@example.com if you require assistance, or have suggestions to improve this documentation.
DGX on M3#
DGX-1V is NVIDIA’s purpose built Deep Learning server.
Each box consists of:
8 x Tesla V100-SXM2 (32GB RAM each)
NVIDIA CUDA Cores: 40,960
960 TFLOPS (GPU FP16)
NVLink within each box, provides rapid inter-GPU communication
2 x 20-cores Intel Xeon E-2698 v4 2.2 GHz
512 GB System 2133 Mhz DDR4
4x IB 100Gb/s (EDR), Dual 10 GbE
How to access the DGX hardware#
The DGX hardware is currently specialized, and to gain access users must make a request via the form.
This will allow us to:
keep track of and review the allocations;
monitor the usage patterns, based on your research; and
start engaging the community and provide support.
Access is via our SLURM scheduler, and approved users must specify the following in their submission scripts.
#SBATCH --partition=dgx #SBATCH --qos=dgx
The max walltime is 24hrs (which is equivalent to more than 7 days of workstation run time)
What jobs are suitable?#
For a job to run on the DGX, it must make use of the specialized hardware of this machine, i.e. we would like to see it have the following attributes:
high efficiency multi-GPU >3
command line (non-GUI)
If it does not meet these requirements, users are recommended to use the other GPU hardware provided by M3.
Some jobs that would be suitable include:
- Deep learning
Tensorflow (Note that the built-in function to scale out the training is slower than other tools)
And potentially others.
For more information of checkpointing with some machine learning tools, please see our machine learning community documentation.
How do you demonstrate a suitable project/job?#
The overall principle is to leverage the specialised nature of the hardware.
Run jobs on V100 nodes with monitoring of GPU, CPU and I/O performance
Show restart/resume scripts and techniques
Show scaling of code
We can assist you with your job script to log the GPU & CPU usage in order to improve the performance of your analysis.