Attention

This documentation is under active development, meaning that it can change over time as we refine it. Please email help@massive.org.au if you require assistance, or have suggestions to improve this documentation.

Bioinformatics#

MASSIVE supports the bioinformatics, genomics and translational medicine community with storage and compute services. On these pages we will provide examples of workflows and software settings for running common bioinformatics software on M3.

Requesting an account on M3#

If you are requesting an account on M3 and are working in collaboration with the Monash Bioinformatics Platform, please ensure you indicate this in the application and request that the appropriate Platform members are added to your M3 project. This will enable them to assist in your analysis.

The Genomics partition#

There is a dedicated partition for high impact genomics projects. You can apply for access here.

Your access will be reviewed, usually within 2 business days. You will receive a confirmation email when access is approved.

Once approved, you can follow the instructions at this link to find out how to start using the Genomics partition in your SLURM job script.

The Genomics partition is comprised of 960 cores built specifically to address bioinformatics jobs with short wall time. We are priortising the short jobs which account for over 90% of bioinformatics jobs on MASSIVE.

The partition is also accessible to all MASSIVE users at a lower priority, as interruptible jobs. You can use irq qos to run interruptible jobs when the computers are not in use. Please note that if you intend to run interruptible jobs, you need to make sure your script will checkpoint your work, as your job will be re-queued when there are higher priority jobs.

There is a regular technical review process to understand the usage of this partition, in the form of a meeting which happens a minimum of twice a year (more frequently if there are matters to address). If you wish to participate in the review process, please indicate your interest via the MASSIVE helpdesk.

Getting started with the Bioinformatics module#

Importing the Bioinformatics module environment#

M3 has a number of bioinformatics packages available in the default set of modules. These include versions of bwa, bamtools, bcftools, bedtools, GATK, bcl2fastq, BEAST, BEAST2, bowtie, bowtie2, cufflinks, cytoscape, fastx-toolkit, kallisto, macs2, muscle, phyml, picard, qiime2, raxml, rstudio, samtools, star, sra-tools, subread, tophat, varscan, vep (this list shouldn’t be regarded as exhaustive !).

A software stack of additional packages (known as bio-ansible) is maintained by the Monash Bioinformatics Platform (MBP).

Tools are periodically added as required by the user community.

Modules maintained by MBP are installed at: /usr/local2/bioinformatics/

To access these additional tools, type:

source /usr/local2/bioinformatics/bioansible_env.sh

This loads the bio-ansible modules into your environment alongside the default M3 modules. If you are using this frequently you might like to source this in your .profile / .bash_profile.

To list all modules:

module avail

You will see the additionals tools listed under the /usr/local2/bioinformatics/software/modules/bio section, followed by the default M3 modules.

Installing additional software with Bioconda#

In addition to the pre-installed modules available on M3, Bioconda provides a streamlined way to install reproducible environments of bioinformatics tools in your home directory.

Conda is already installed on M3 under the anaconda module.

module load anaconda

conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

These channels will now be listed in ~/.condarc and will be searched when installing packages by name.

Create a Conda enviroment for your analysis / pipeline / project#

Conda works by installing a set of pre-compiled tools and their dependencies in self-contained ‘environments’ which you can switch between. Unlike modules loaded via module load, you can only have a single Conda environment active at one time.

For example, to create an enviroment my_proj_env with a specific version of STAR, subread and asciigenome:

# Change this to your M3 project ID
export PROJECT=df22
export CONDA_ENVS=/projects/$PROJECT/$USER/conda_envs

mkdir -p $CONDA_ENVS

module load anaconda

conda create --yes -p $CONDA_ENVS/my_proj_env star=2.5.4a subread=1.5.2 asciigenome=1.12.0

# To use the environment, activate it
conda activate $CONDA_ENVS/my_proj_env

# To leave the environment, deactivate it
conda deactivate

You can search for packages on the command line with: conda search <package_name>, or on the web using the Bioconda recipes list.

For further details see the official Bioconda documentation.

Pipelines and workflow managers on M3#

Running NextFlow on M3#

NextFlow is a popular workflow manager that helps create portable, reproducible and resumable pipelines. It’s well documented, actively developed and there are an increasing number of great example pipelines available, making it relatively easy to adopt a community supported one, or write your own.

NextFlow has good support for SLURM and runs on M3. Workflows can be made more portable and reproducible by assigning Singularity containers or Conda environments to each task.

Installing NextFlow#

The official NextFlow installation instructions work on M3. This is a great way to get started quickly. NextFlow can also be installed into a Conda environment (conda install -c bioconda nextflow - see the Create a conda environment for your analysis / pipeline / project section above) - since Nextflow requires Java, this can be an easy way to ensure you are using a supported version of Java.

NextFlow configuration for M3#

Create a nextflow.config file in the directory with your .nf workflow (or create ~/.nextflow/config to make it global for all workflows). Here’s a good starting point for M3:

/*
// If you have access to the genomics partition, this function allows you to try jobs there first,
// and retries jobs that exceed the 4 hour walltime limit on 'comp'.

def partition_switch(attempt) {(attempt == 1) ? '--partition=genomics --qos=genomics --time=0-4:00:00' :
                                                '--partition=comp --time=7-00:00:00'}
*/

executor {
    name = 'slurm'
    queueSize = 200
    pollInterval = '10 sec'
    queueStatInterval = '10m'
}

singularity {
    enabled = true
    runOptions = '-B /scratch -B /projects -B /fs02 -B /fs03 -B /fs04'
    autoMounts = true
}

process {
    executor = 'slurm'
    stageInMode = 'symlink'
    errorStrategy = 'retry'
    maxRetries = 3
    cache = 'lenient'
    beforeScript = 'module load singularity'

    /*
      You may wish to customize `clusterOptions` for your situation.
      `clusterOptions` are the like the options you might usually
       use with #SBATCH - but **don't** specify time, memory or cpus / tasks
       in `clusterOptions` - Nextflow does that per-task
    */

    // If you have a particular M3 project ('account') you want to use, specify it here
    // clusterOptions = '--partition comp --account example123'

    // If you are using the genomics partition, consider using
    // this version of clusterOptions instead
    // clusterOptions = { partition_switch(task.attempt) }
}

profiles {
    local {
        executor {
            name = 'local'
            queueSize = 32
            pollInterval = '30 sec'
        }
        process {
            executor = 'local'
            stageInMode = 'symlink'
            errorStrategy = 'retry'
            maxRetries = 5
            cache = 'lenient'
            beforeScript = 'module load singularity/3.5.3'
        }
    }
}

This config defaults to submitting tasks to the queue using your default SLURM account. For testing you may wish to use the local profile via the NextFlow command line option -profile local - this is helpful for interactive testing in an smux session, but be careful not to accidentally run heavy tasks on the login node! For more detail on tweaking the configuration to your needs, see the Nextflow Configuration docs and the many examples at nf-core/configs.

(Thanks to Jason Steen for providing the starting point for this M3-specific config)

If you’d like to run a quick test workflow, try the simple workflow below. Save it as example.nf in the same directory as your nextflow.config.

#!/usr/bin/env nextflow

/*
* A simple Nextflow FastQC example
*/

// Declare commandline options
params.inputPath = file('fastqs')
params.outputPath = file('output')

process fastqc {
label 'fastqc'
publishDir "${params.outputPath}", mode: 'copy'
container "https://depot.galaxyproject.org/singularity/fastqc:0.12.1--hdfd78af_0"
cpus = 2
memory = 8.GB

input:
    tuple val(sample_name), path(fastqs)

output:
    tuple val(sample_name), path("*.zip"), emit: zips
    tuple val(sample_name), path("*.html"), emit: html_reports

script:
"""
fastqc -t ${task.cpus} ${fastqs[0]} ${fastqs[1]}
"""
}

workflow {

// Create a channel of input FASTQ files (catching most common filename variations)
reads_ch = Channel.fromFilePairs(["${params.inputPath}/*_{1,2}.f*q.gz",
                                    "${params.inputPath}/*_{R1,R2}.f*q.gz",
                                    "${params.inputPath}/*_{R1_001,R2_001}.f*q.gz"])

fastqc(reads_ch)
}

Then run it like:

# Get some example reads
mkdir -p fastqs output
curl https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR550/003/SRR5507343/SRR5507343_1.fastq.gz >fastqs/SRR5507343_1.fastq.gz
curl https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR550/003/SRR5507343/SRR5507343_2.fastq.gz >fastqs/SRR5507343_2.fastq.gz

# Load the Singularity module so container(s) can be pulled
module load singularity

# Run Nextflow
nextflow run example.nf --inputPath=fastqs --outputPath=output

If you are using the nextflow.config as above, jobs will automatically be submitted to the queue (check squeue -u $USER while it’s running).

The ‘work’ directory#

Nextflow puts intermediate files in a directory named work by default. While this directory is required to resume failed pipeline runs, be aware that work can become quite large. Once your pipeline is finished, you should generally delete the work directory to free up space.

nf-core pipelines#

If you are using one of the popular nf-core <https://nf-co.re/> pipelines, you may need to change the resources allocated to specific tasks to suit your data or the available compute resources. This is done using withName and withLabel inside the process {} section of the config. For example:

 process {
      withName: 'NFCORE_MAG:MAG:SPADES' {
          cpus = 16
          time = 7.d
          memory = 256.GB
          clusterOptions = '--partition=comp --time=7-00:00:00'

          // Ignore failed tasks so they don't stop the whole pipeline
          //errorStrategy = 'ignore'
  }
}

It’s a good idea to examine the execution report <https://www.nextflow.io/docs/latest/tracing.html#resource-usage> to determine if your workflow steps are using CPU cores and memory effciently - if the see a process is typically under-utilizing memory or CPU you can reduce it’s resource allocation as above (Why ? Your jobs will start faster and the share resource will be used more effciently).

See “Tuning workflow resources” <https://nf-co.re/docs/usage/configuration#tuning-workflow-resources> for examples of using withName and withLabel in a nextflow.config.

FAQ#

Q : You have version xx and I need version YY, how do I get the software?

A : You should consider installing the software in your home directory with Conda. The Bioconda project helps streamline this process with many pre-packaged tools for bioinformatics. If you are unable to install the version you need, please contact the helpdesk at help@massive.org.au