Attention

This documentation is under active development, meaning that it can change over time as we refine it. Please email help@massive.org.au if you require assistance, or have suggestions to improve this documentation.

Bioinformatics

MASSIVE supports the bioinformatics, genomics and translational medicine community with storage and compute services. On these pages we will provide examples of workflows and software settings for running common bioinformatics software on M3.

Requesting an account on M3

If you are requesting an account on M3 and are working in collaboration with the Monash Bioinformatics Platform, please ensure you indicate this in the application and request that the appropriate Platform members are added to your M3 project. This will enable them to assist in your analysis.

The Genomics partition

There is a dedicated partition for high impact genomics projects. You can apply for access here.

Your access will be reviewed, usually within 2 business days. You will receive a confirmation email when access is approved.

Once approved, you can follow the instructions at this link to find out how to start using the Genomics partition in your SLURM job script.

The Genomics partition is comprised of 960 cores built specifically to address bioinformatics jobs with short wall time. We are priortising the short jobs which account for over 90% of bioinformatics jobs on MASSIVE.

The partition is also accessible to all MASSIVE users at a lower priority, as interruptible jobs. You can use irq qos to run interruptible jobs when the computers are not in use. Please note that if you intend to run interruptible jobs, you need to make sure your script will checkpoint your work, as your job will be re-queued when there are higher priority jobs.

There is a regular technical review process to understand the usage of this partition, in the form of a meeting which happens a minimum of twice a year (more frequently if there are matters to address). If you wish to participate in the review process, please indicate your interest via the MASSIVE helpdesk.

Getting started with the Bioinformatics module

Importing the Bioinformatics module environment

M3 has a number of bioinformatics packages available in the default set of modules. These include versions of bwa, bamtools, bcftools, bedtools, GATK, bcl2fastq, BEAST, BEAST2, bowtie, bowtie2, cufflinks, cytoscape, fastx-toolkit, kallisto, macs2, muscle, phyml, picard, qiime2, raxml, rstudio, samtools, star, sra-tools, subread, tophat, varscan, vep (this list shouldn’t be regarded as exhaustive !).

A software stack of additional packages (known as bio-ansible) is maintained by the Monash Bioinformatics Platform (MBP).

Tools are periodically added as required by the user community.

Modules maintained by MBP are installed at: /usr/local2/bioinformatics/

To access these additional tools, type:

source /usr/local2/bioinformatics/bioansible_env.sh

This loads the bio-ansible modules into your environment alongside the default M3 modules. If you are using this frequently you might like to source this in your .profile / .bash_profile.

To list all modules:

module avail

You will see the additionals tools listed under the /usr/local2/bioinformatics/software/modules/bio section, followed by the default M3 modules.

Installing additional software with Bioconda

In addition to the pre-installed modules available on M3, Bioconda provides a streamlined way to install reproducible environments of bioinformatics tools in your home directory.

Conda is already installed on M3 under the anaconda module.

module load anaconda

conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

These channels will now be listed in ~/.condarc and will be searched when installing packages by name.

Create a Conda enviroment for your analysis / pipeline / project

Conda works by installing a set of pre-compiled tools and their dependencies in self-contained ‘environments’ which you can switch between. Unlike modules loaded via module load, you can only have a single Conda environment active at one time.

For example, to create an enviroment my_proj_env with a specific version of STAR, subread and asciigenome:

# Change this to your M3 project ID
export PROJECT=df22
export CONDA_ENVS=/projects/$PROJECT/$USER/conda_envs

mkdir -p $CONDA_ENVS

module load anaconda

conda create --yes -p $CONDA_ENVS/my_proj_env star=2.5.4a subread=1.5.2 asciigenome=1.12.0

# To use the environment, activate it
conda activate $CONDA_ENVS/my_proj_env

# To leave the environment, deactivate it
conda deactivate

You can search for packages on the command line with: conda search <package_name>, or on the web using the Bioconda recipes list.

For further details see the official Bioconda documentation.

“Pipelines” and workflow managers on M3

Running NextFlow on M3

NextFlow is a popular workflow manager that helps create portable, reproducible and resumable pipelines. It’s well documented, actively developed and there are an increasing number of great example pipelines available, making it relatively easy to adopt or write your own. NextFlow has good support for SLURM and runs on M3. Workflows can be made more portable and reproducible by assigning Singularity containers or Conda environments to each task.

Installing NextFlow

The official NextFlow installation instructions work on M3. This is a great way to get started quickly. NextFlow can also be installed into a Conda environment (conda install -c bioconda nextflow - see the Create a conda environment for your analysis / pipeline / project section above) - this can be a better way to reproducibly manage your NextFlow version(s) in the longer term.

NextFlow configuration for M3

Create a nextflow.config file in the directory with your .nf workflow (or create ~/.nextflow/config to make it global for all workflows). Here’s a good starting point for M3:

executor {
    name = 'slurm'
    queueSize = 200
    pollInterval = '10 sec'
    queueStatInterval = '10m'
}

singularity {
    enabled = true
    runOptions = '-B /scratch -B /projects'
    autoMounts = true
}

process {
    executor = 'slurm'
    stageInMode = 'symlink'
    errorStrategy = 'retry'
    maxRetries = 3
    cache = 'lenient'
    beforeScript = 'module load singularity/3.5.3'
    clusterOptions = {
        qos = task.time <= 30.minutes ? 'shortq' : 'normal'
        partition = task.time <= 30.minutes ? 'short,comp' : 'comp'
        return "--qos=${qos} --partition=${partition}"
    }
}

profiles {
    local {
        executor {
            name = 'local'
            queueSize = 32
            pollInterval = '30 sec'
        }
        process {
            executor = 'local'
            stageInMode = 'symlink'
            errorStrategy = 'retry'
            maxRetries = 5
            cache = 'lenient'
            beforeScript = 'module load singularity/3.5.3'
        }
    }
}

This config defaults to submitting tasks to the queue using your default SLURM account. For testing you may wish to use the local profile via the NextFlow command line option -profile local - this is helpful for interactive testing in an smux session, but be careful not to accidentally run heavy tasks on the login node! For more detail on tweaking the configuration to your needs, see the Nextflow Configuration docs and the many examples at nf-core/configs.

(Thanks to Jason Steen for providing the starting point for this M3-specific config)

If you’d like to run a quick test workflow, try the simple workflow below. Save it as example.nf in the same directory as your nextflow.config.

#!/usr/bin/env nextflow

# This example workflow takes a string, splits it into chunks
# and capitalizes each chunk. Output is dumped to stdout, in the
# order that tasks finish (so chunks may not be returned in their
# original order).

params.str = 'abcdefghijklmnopqrstuvwxyz'

process splitLetters {
    cpus 1
    memory '16M'
    time '1m'

    output:
    file 'chunk_*' into letters
    """
    printf '${params.str}' | split -b 8 - chunk_
    """
}
process convertToUpper {
    cpus 1
    memory '16M'
    time '1m'

    input:
    file x from letters.flatten()

    output:
    stdout result

    """
    cat $x | tr '[a-z]' '[A-Z]'
    """
}

Then run on the queue it like:

nextflow example.nf

FAQ

Q : You have version xx and I need version YY, how do I get the software?

A : You should consider installing the software in your home directory with Conda. The Bioconda project helps streamline this process with many pre-packaged tools for bioinformatics. If you are unable to install the version you need, please contact the helpdesk at help@massive.org.au