Scheduled Maintenance#

We have scheduled quarterly outages for the M3 cluster. This is to ensure we communicate scheduled outages to our HPC users in advance. Where possible, we perform rolling upgrades with the cluster online. However, sometimes we have to perform upgrades that require the cluster to be taken offline. These include:

  • system software upgrades;

  • network maintenance;

  • bug and security patches; and

  • hardware maintenance

Upcoming maintenance:

  • March 12-14 2024 - for essential file system and network switch updates (TBC)

Dates are:

  • 19/04/2023 FS03 update

  • 7-8/12/21 (COMPLETED)

  • Third-quarter maintenance is delayed until further notice.

  • 09/02/21 (COMPLETED)

  • 3-4/12/20 (COMPLETED)

  • 16/09/20 (POSTPONED)

  • 24/06/20 (COMPLETED)

  • 25/03/20 (COMPLETED)

  • 11/12/19 (COMPLETED)

  • 18/09/19 (COMPLETED)

  • 26/06/19 (COMPLETED)

  • 20/03/19 (COMPLETED)

  • 12/12/18 (COMPLETED)

  • 05/09/18 (COMPLETED)

  • 06/06/18 (COMPLETED)

  • 14/03/18 (COMPLETED)

  • 07/03/18 (POSTPONED)

  • 09/01/18 (COMPLETED)

  • 06/12/17 (COMPLETED)

  • 20/09/17 (COMPLETED)

  • 05/07/17 (COMPLETED)

  • 22/03/17 (COMPLETED)


19 April 2023: FS03 update

Please be advised that MASSIVE M3 will be undergoing a three-day scheduled maintenance commencing Wednesday, the 19th of April, 2023.

The following work will be conducted on /scratch:

  • a check of the file system quotas;

  • an update to the firmware on the Lustre servers; and

  • an upgrade of the system software underpinning the file system.

During the maintenance window, M3 will not be accessible and all nodes will be drained of running jobs prior to the start of the maintenance. Our apologies in advance for any inconvenience this may cause.

We will communicate further details seven days prior to the maintenance window.

For any queries, please contact us at help@massive.org.au.

** M3 Scheduled Maintenance: Network Uplift - 7-8 December 2021 - COMPLETED **

Please be advised that M3 will be offline for network upgrade during these two days:

Tuesday, the 7th of December (07:00 AEDT) Wednesday, the 8th of December (22:00 AEDT).

A complete system outage is necessary as the underpinning network fabric will be upgraded. Specifically, we will be configuring all network devices to implement the RDMA over Converged Ethernet (RoCE) version 2.

This upgrade will require a reboot of all network switches and network clients, including all login, data-transfer, compute nodes and the file system servers. We expect an improved network infrastructure under RoCEv2, with better network traffic flow control and reduced congestion. Extensive testing and validation will be conducted after the implementation of the change.

During the two-day maintenance period, you will not be able to login to the cluster (via Desktop, SSH, rsync, or sftp) and there will be no access to your data on M3. A reservation is now in place to ensure that running jobs complete before the start of the maintenance window. Pending jobs will remain on the queue and will only start to run after service restoration.

For any queries, please feel free to contact us at help@massive.org.au

5-6 JUL 2021 Maintenance

Our regular maintenance has been scheduled for Monday Jul 5th 8:00 AM AEDT through Tuesday Jul 6th 6:00 PM AEDT.

Due to the nature of the upcoming work, a full system outage is required. This upgrade is a necessary preparatory step for an upcoming change to the network infrastructure to reduce the packet congestion in the fabric. This two-day scheduled maintenance will allow us to:

  • Upgrade the software version of the Switch OS on network switches;

  • Verify and validate the configuration of all switches;

  • Perform extensive functional and benchmarking tests on the entire fabric to generate a baseline for the upcoming changes to the network fabric;

  • Update the SLURM scheduler to version 20.02.06; and

  • Conduct maintenance work on NFS servers that host M3 software stack and home directories.

If you have any queries or concerns regarding this outage, please contact help@massive.org.au

9 FEB 2021 Maintenance - COMPLETED

The M3 scheduled maintenance has now completed.

The following works were conducted during this maintenance:

  • Network interconnect changed from a multi-rail config to a single-rail config;

  • Operating system updates on the Lustre file servers;

  • Driver and firmware updates for network cards of the file servers;

  • Update the monitoring service for the file system;

  • Apply security patches on the login and compute nodes; and

  • Update driver on GPU nodes to enable the use of CUDA 11.2.

If you have any queries or concerns regarding this outage, please contact help@massive.org.au

9 FEB 2021 Maintenance

Our regular maintenance has been scheduled for 9 Feb 2021, starting 8:00 a.m. - 6:00 p.m.

A full system outage is required as this maintenance will involve changes in the network configuration of the file system servers in order to diagnose the poor I/O performance on /scratch which is being caused by a congestion on our network interconnect. M3 will be offline during the outage window and users will not be able to access the login and data transfer nodes via ssh, rsync, or Strudel Web/Desktop. This work is important as preparation for a major networking upgrade that we are planning for to address the /scratch congestion issue.

For your interest we will be undertaking the following work on the file system servers:

  • Changing the configuration of the network interconnect from a multi-rail config to a single-rail config;

  • Update the operating system;

  • Driver and firmware updates for network cards; and

  • Update monitoring service for the file system.

If you have any queries or concerns regarding this outage, please contact help@massive.org.au

3-4 DEC 2020 Maintenance - COMPLETED

The M3 scheduled maintenance has now completed.

We have lifted the reservation and pending jobs have started to run. You will be able to submit jobs or start Strudel Desktop sessions. The login and data transfer nodes are now available for access.

The following works were conducted during this maintenance:

  • Upgraded the file system to the patched version;

  • Updated the firmware of the high-speed network cards on the file servers; and

  • Conducted a series of functional tests and I/O benchmark tests.

We thank you again for your kind patience. This was a critical activity to ensure that /projects allocations can continue until a new file system to replace /projects is commissioned in 2021.

If you have any queries or concerns regarding this maintenance, please contact help@massive.org.au.

3-4 DEC 2020 Maintenance

Our regular maintenance has been scheduled for Thursday Dec 3rd 8:00 AM AEDT through Friday Dec 4th 6:00 PM AEDT.

A full system outage is required as this will impact the job scheduler. M3 will be offline during the outage window and users will not be able to access the login nodes via ssh or Strudel Desktop (remote desktop).

A reservation has been created for this outage. Jobs that would run into this outage will not start so please take a look at your script if your job doesn’t start as expected in the queue.

For your interest, the following works will be conducted:
  • Update the firmware of the filesystem servers and storage arrays;

  • Decommission the old /scratch filesystem;

  • Recable the network switches to expand cluster low latency fabric; and

  • After the upgrade, adding /projects filesystem to the centralised software monitoring system.

If you have any queries or concerns regarding this outage, contact help@massive.org.au.

26 AUG 2020 Maintenance - COMPLETED

The M3 maintenance is now completed.

For your interest, the following works were conducted:

  • The SFA OS on the disk arrays were updated from version 11.8.0 to 11.8.1;

    This is an urgent update to fix a vulnerability that has a small potential to cause data corruption during a disk rebuild. Our vendor has strongly suggested that we undertake this as soon as possible and therefore we are scheduling this in advance of our regular maintenance schedule. This is not related to other file system issues we have been experiencing.

  • Hypervisor components were upgraded: BIOS, firmware, kernel, OS; and

  • The NVIDIA driver on GPU nodes were upgraded.

If you have any queries or concerns regarding this outage, contact help@massive.org.au.

24 JUN 2020 Maintenance - COMPLETED

The M3 scheduled maintenance is now completed.

For your interest, the following works have been conducted:

  • Synchronisations of projects to new /scratch file system;;

  • Validation of file counts, file permissions, usage and quota;

  • Data integrity checks on a significant sample of migrated files;

  • Validation and testing of user applications on the new file system;

  • Release of the new /scratch file system; and

  • Upgrade of network switches.

If you have any queries or concerns regarding this outage, contact help@massive.org.au.


25 MAR 2020 Maintenance - COMPLETED

The M3 scheduled maintenance is now completed.

For your interest, the following works have been conducted:

  • Upgrade and apply security patches on the login nodes;

  • Upgrade and apply security patches for home directory server; and

  • Recable the network switches to expand cluster low latency fabric.

If you have any queries or concerns regarding this outage, contact help@massive.org.au.


25 MAR 2020 Maintenance

Our regular maintenance has been scheduled for 25 Mar 2020, starting 8:00 a.m. - 6:00 p.m.

A full system outage is required as this will impact the job scheduler. M3 will be offline during the outage window and users will not be able to access the login nodes via ssh or Strudel Desktop (remote desktop).

A reservation has been created for this outage. Jobs that would run into this outage will not start so please take a look at your script if your job doesn’t start as expected in the queue.

For your interest, the following works will be conducted:

  • Upgrade and apply security patches on the login nodes;

  • Upgrade and apply security patches for home directory server;

  • Recable the network switches to expand cluster low latency fabric; and

  • Reproduce and troubleshoot Lustre file servers issue.

If you have any queries or concerns regarding this outage, contact help@massive.org.au.


11 DEC 2019 Maintenance - COMPLETED

The M3 scheduled maintenance is now completed.

For your interest, the following works will be conducted:

  • Patching Lustre clients on the compute and login nodes;

  • Troubleshooting Lustre file servers under heavy workload;

  • Troubleshooting network switches that connected to the Lustre servers; and

  • Upgrading m3-dtn1.

For your information, this will be our last scheduled maintenance for 2019. If you have any queries or concerns regarding this outage, contact help@massive.org.au.


11 DEC 2019 Maintenance

Our regular maintenance has been scheduled for 11 Dec 2019, starting 8:00 a.m. - 6:00 p.m.

A full system outage is required as this will impact the job scheduler. M3 will be offline during the outage window and users will not be able to access the login nodes via ssh or Strudel Desktop (remote desktop).

A reservation has been created for this outage. Jobs that would run into this outage will not start so please take a look at your script if your job doesn’t start as expected in the queue.

For your interest, the following works will be conducted:

  • Patching Lustre clients on the compute and login nodes;

  • Troubleshooting Lustre file servers under heavy workload;

  • Troubleshooting network switches that connected to the Lustre servers; and

  • Upgrading m3-dtn1.

For your information, this will be our last scheduled maintenance for 2019. If you have any queries or concerns regarding this outage, contact help@massive.org.au.


18 SEP 2019 Maintenance - COMPLETED

The M3 scheduled maintenance is now completed.

For your interest, the following works have been conducted during this maintenance:

  • Upgrading the data link for CEPH fabric;

  • Patching and upgrading the NFS servers;

  • Upgrading hypervisor components; and

  • Patching and upgrading networking switches.

We didn’t manage to upgrade all the switches and work will be continued in the background in the coming weeks.

If you have any queries or concerns regarding this outage, contact help@massive.org.au.


18 SEP 2019 Maintenance

Our regular maintenance has been scheduled for 18 Sep 2019, starting 7:00 a.m. - 9:00 p.m.

A full system outage is required as this will impact the job scheduler. M3 will be offline during the outage window and users will not be able to access the login nodes via ssh or Strudel Desktop (remote desktop).

A reservation has been created for this outage. Jobs that would run into this outage will not start so please take a look at your script if your job doesn’t start as expected in the queue.

For your interest, the following works will be conducted:

  • Upgrading the data link on CEPH fabric;

  • Patching and upgrading the NFS servers;

  • Upgrading hypervisor components on management nodes; and

  • Patching and upgrading spine and leaf switches.

If you have any queries or concerns regarding this outage, contact help@massive.org.au.


26 JUN 2019 Maintenance - COMPLETED

The M3 scheduled maintenance is now completed.

All the jobs have been resumed.

For your interest, the following works have been conducted:

  • Upgrading hypervisor compute components;

  • Patching and upgrading the login node and management node;

  • Upgrading BIOS and firmware for Lustre servers; and

  • Upgrading disk firmware for Lustre servers.


26 JUN 2019 Maintenance - COMPLETED

Our regular system maintenance has been scheduled for Wednesday the 26th of June from 8:00 a.m. to 5:00 p.m.

A full system outage is required as this will impact the job scheduler. M3 will be offline during the outage window and users will not be able to access the login nodes via ssh or Strudel Web/Desktop (remote desktop).

A reservation has been created for this outage. Jobs that would run into this outage will not start so please take a look at your script if your job doesn’t start as expected in the queue.

For your interest, the following works will be conducted:

  • Patching and upgrading the NFS server;

  • Upgrading hypervisor compute components;

  • Patching and upgrading the login nodes and data transfer node; and

  • Patching and upgrading the compute nodes;

If you have any queries or concerns regarding this outage, contact help@massive.org.au.


20 MAR 2019 Maintenance - COMPLETED

Our regular maintenance has been scheduled for 20 Mar 2019, starting 8:00 a.m. - 5:00 p.m.

A full system outage is required as this will impact the job scheduler. M3 will be offline during the outage window and users will not be able to access the login nodes via ssh or Strudel Desktop (remote desktop).

A reservation has been created for this outage. Jobs that would run into this outage will not start so please take a look at your script if your job doesn’t start as expected in the queue.

For your interest, the following works will be conducted:

  • Upgrading the job scheduler;

  • Patching and upgrading the NFS server;

  • Configuring the database server; and

  • Upgrading hypervisor compute components.

If you have any queries or concerns regarding this outage, contact help@massive.org.au.


12 DEC 2018 Maintenance - COMPLETED

The M3 scheduled maintenance is now completed.

All the jobs have been resumed.

For your interest, the following works have been conducted:

  • NFS server uplift;

  • Upgrade operating system on the login and management nodes; and

  • Upgrade operating system on the hypervisors.


12 DEC 2018 Maintenance - COMPLETED

Our regular maintenance has been re-scheduled for 12 Dec starting 8:00 a.m. - 5:00 p.m.

A full system outage is required as this will impact the job scheduler. M3 will be offline during the outage window and users will not be able to access the login nodes via ssh or Strudel Desktop (remote desktop).

A reservation has been created for this outage. Jobs that would run into this outage will not start so please take a look at your script if your job doesn’t start as expected in the queue.

For your interest, the following works will be conducted:

  • Moving home dir to a filesystem with quota enforced;

  • Adding more vCPU on the NFS server;

  • Upgrade operating system on the login and management nodes; and

  • Upgrade operating system on the hypervisors.

If you have any queries or concerns regarding this outage, contact help@massive.org.au.


Research network maintenance 28 Nov Wed 9:00 a.m. - 5:00 p.m. - COMPLETED

Monash network team is going to carry out a network change to enable new routing method and automation at the switches in Monash research network. Network switches often require updating to take advantage of improved functionality and stability. During the update, each network fabric will be interrupted for around 30 seconds.

As M3 is sitting on a number of network fabrics, jobs will be paused at 9 a.m. and we are anticipating a disruption to the user home directories, Strudel desktop jobs and login access at different point throughout the day.

If you have any queries or concerns regarding this outage, contact help@massive.org.au.


5 SEP 2018 Maintenance - COMPLETED

For this outage, we have completed the following works:

  • Upgraded storage firmware;

  • Upgraded hardware firmware on the Lustre server; and

  • Fixed MTU issue on the host networking;

We are running behind with our schedules due to some unforeseen issues and over the next few days, we are going to complete the following task:

  • Upgrade operating system on the login and management nodes


5 SEP 2018 Maintenance - COMPLETED

Our regular maintenance has been scheduled for 5 Sep starting 8:30 a.m. - 5:30 p.m.

A full system outage is required as this will impact the job scheduler. M3 will be offline during the outage window and users will not be able to access the login nodes via ssh or Strudel Desktop (remote desktop).

A reservation has been created for this outage. Jobs that would run into this outage will not start so please take a look at your script if your job doesn’t start as expected in the queue.

For your interest, the following works will be conducted:

  • Upgrade storage firmware;

  • Upgrade hardware firmware on the Lustre server;

  • Fix MTU issue on the host networking;

  • Upgrade operating system on the login and management nodes; and

  • Apply bug fixes on the job scheduler.


6 JUN 2018 Maintenance - COMPLETED

The M3 scheduled maintenance is now completed.

For this outage, we have completed the following works:

  • Upgrade operating system on the switches;

  • Upgrade hardware firmware on the GPU nodes; and

  • Maintenance on the job scheduler (SLURM).

As part of this maintenance, we have also introduced Quality of Service (QoS) to M3 queue. All existing job submission scripts should still be working. But you should start considering using QoS in the future.

Quality of Service (QoS) provides a mechanism to allow a fair model to users who ask for the right resources in the job submission script. The quality of service associated with a job will affect the job in three ways:

  • Job Scheduling Priority

  • Job Preemption

  • Job Limits

A summary of the available QoS:

Queue

Description

Max Walltime

Max GPU per user

Max CPU per user

Priority

Preemption

normal

Default QoS for every job

7 days

10

280

50

No

rtq

QoS for interactive job

48 hours

4

36

200

No

irq

QoS for interruptable job

7 days

No limit

No limit

100

Yes

shortq

QoS for job with short walltime

30 mins

10

280

100

No

Note

Priority: The higher the number the higher the priority is

For more information about how to use QoS, you can follow the link to our documentation page: http://docs.massive.org.au/M3/slurm/using-qos.html

We have also changed an algorithm in calculating fairshare in the queue by enabling priority flag in SLURM.

Lastly, we have now opened up the access to the new Skylake nodes. In order to introduce the new nodes, we have also created a new partition that consists of all the nodes, for more information about the new partition, follow the link to our user documentation page: http://docs.massive.org.au/M3/slurm/check-cluster-status.html


6 JUN 2018 Maintenance

Our regular maintenance has been scheduled for 6 June starting 8:30 a.m. - 5:30 p.m.

A full system outage is required as this will impact the job scheduler. M3 will be offline during the outage window and users will not be able to access the login nodes via ssh or Strudel Desktop (remote desktop).

A reservation has been created for this outage. Jobs that would run into this outage will not start so please take a look at your script if your job doesn’t start as expected in the queue.

For your interest, the following works will be conducted:

  1. Upgrade operating system on the switches;

  2. Upgrade hardware firmware on the compute nodes; and

  3. Maintenance on the job scheduler.


14 MAR 2018 maintenance - COMPLETED

The M3 scheduled maintenance is now completed.

As a result of running the new job scheduler, the software that was previously compiled with old libraries might not work. We have made a new openmpi module available as well, which has been compiled with the new job scheduler libraries:

openmpi/1.10.7-mlx(default)

Here is a list of software that might be affected by the change:

  • abyss

  • chimera

  • dynamo

  • emspring

  • fasttree

  • fsl

  • gdcm

  • geant4

  • gromacs

  • lammps

  • mpifileutils

  • netcdf

  • openfoam

  • python

  • R

  • relion

  • scipion

  • vtk


09 JAN 2018 Maintenance - COMPLETED

As part of the filesystem expansion, we need to completely shut down the storage; an outage is scheduled for 9 Jan. 2018 from 8:00 a.m. - 9:00 p.m.

A full system outage is required and this will impact the access to the project directories and running jobs. M3 will be offline during the maintenance window so users will not be able to access the login nodes via SSH or Strudel Desktop (remote desktop).


06 DEC 2017 Maintenance - COMPLETED

We are still working on m3-dtn1.massive.org.au server but job scheduler is back online. All the remaining nodes are back online.

The following works were completed during the scheduled outage:

  • Physical move of two Lustre servers to make space for the filesystem expansion;

  • ZFS updates on NFS services; and

  • A set of scripts has been pushed out to lower the resources for Strudel Large Desktop

The reason for this change is to make desktop utilization more efficient and to provide more large-scale desktops available to our user community.

From today, M3 Large desktops will be halved as follows:

Configuration   Old Large Desktop       New Large Desktop
-------------   -----------------       -----------------
Memory          240GB                   120GB
GPU             4 x K80                 2 x K80
CPU             2 x 12 Cores            1 x 12 Cores

Based on our analysis we do not expect the majority of users to notice this change or be negatively impacted.

We understand that some users genuinely require a large GPU desktop configuration change might impact you, and if you ever need to run a desktop job that would require more than two GPUs, send us a request and we are more than happy to assist and attend to the request.


06 DEC 2017 Maintenance - COMPLETED

The final quarterly maintenance for this year is scheduled for 6 Dec. from 8:30 a.m. - 4:30 p.m.

A full system outage is required and this will impact access to the project directories and running jobs. M3 will be offline during the maintenance window so users will not be able to access the login nodes via SSH or Strudel Desktop (remote desktop).

For your interest, the following works will be conducted:

  • Physical move of two Lustre servers to make space for the filesystem expansion

  • System updates for NFS services

  • A set of scripts will be pushed out to lower the resources for Strudel Large Desktop

A reservation has been created for this outage. Jobs that would run into this outage will not start so please take a look at your script if your job doesn’t start as expected in the queue.

Based on usage data and user demand, we are making changes to the Desktop offerings on M3. The main change is the introduction of more Large Desktop resources to users. This involves decreasing the number of the GPUs allocated to each of these jobs. This will allow more users to utilise the GPU nodes in a more efficient way. If you ever need to run a desktop job that would require more than two GPUs, send us a request and we will be more than happy to assist you.


20 SEP 2017 Maintenance - COMPLETED

The following works were completed during the scheduled outage:

  • NIC firmware upgraded on the management servers;

  • Kernel update on the m3-login1;

  • Kernel update on management servers’ hypervisor;

  • ZFS on the NFS services were updated.

Over the next few days, we are going to continue upgrading the cluster nodes. Nodes will be offline as required but there won’t be any interruption to the job scheduler.

Jobs should be running now and please report any issues you encounter with M3 to help@massive.org.au.