The MASSIVE logo

Welcome to the M3 user guide

Important

Update on M3 network and file system issues

22 March 2021 - Normal Service

On the 18th of March, the Monash HPC team, the Cloud team, the eSolutions Network team and hardware vendors worked late into the night to perform a major change on the research network to reduce congestion on the data fabric.

The team tuned the switch port buffer sizes on all 18 leaf switches in the Monash-03 network to eliminate the high latencies and packet drop issues we have previously observed on the fabric. Stress and functional tests on the file systems have also been conducted at different stages during the maintenance to validate and measure the effects of the changes applied.

The result of the change is that M3 /scratch, and M3 /projects file systems have improved performance. User jobs are running and we invite you to use M3 /scratch as normal.

Preparations are underway on the next steps, which include:

  • upgrading the network switches - which has been partially applied; and

  • upgrading to RoCE v2 (RDMA over Converged Ethernet) which will be scheduled within one month.

Important

Planned maintenance outages

We have scheduled quarterly outages for the M3 cluster. This is to ensure we communicate scheduled outages to our HPC users in advance. Where possible, we perform rolling upgrades with the cluster online. However, sometimes we have to perform upgrades that require the cluster to be taken offline. These include:

  • system software upgrades;

  • network maintenance;

  • bug and security patches; and

  • hardware maintenance

This site contains the documentation for the MASSIVE HPC systems. M3 is the newest addition to the facility.

Using M3

Communities