.. attention::

    This documentation is under active development, meaning that it can change over time as we refine it. 
    Please email `help@massive.org.au <mailto:help@massive.org.au>`_ if you require assistance, or have suggestions 
    to improve this documentation.

.. _bioinformaticsindex:

Bioinformatics
==============
MASSIVE supports the bioinformatics, genomics and translational medicine community with storage and compute services.
On these pages we will provide examples of workflows and software settings for running common bioinformatics software 
on M3.

Requesting an account on M3
---------------------------
If you are requesting an account on M3 and are working in collaboration with the Monash Bioinformatics Platform,
please ensure you indicate this in the application and request that the appropriate Platform members are added to
your M3 project. This will enable them to assist in your analysis.

The `Genomics` partition
========================

There is a dedicated partition for high impact genomics projects. You can apply for access `here. <https://docs.google.com/forms/d/e/1FAIpQLSetyJPtEU2p82Jg6ViTMohMZJ35jwOm3lh7oH0fy141Bo64fQ/viewform>`_

Your access will be reviewed, usually within 2 business days. You will receive a confirmation email when access is approved.

Once approved, you can follow the instructions at this `link <https://docs.massive.org.au/M3/slurm/using-qos.html#an-example-of-slurm-job-script-for-genomics-partition>`_ to find out how to start using the `Genomics` partition in your SLURM job script.

The `Genomics` partition is comprised of 960 cores built specifically to address bioinformatics jobs with short wall time. We are priortising the short jobs which account for over 90% of bioinformatics jobs on MASSIVE.

The partition is also accessible to all MASSIVE users at a lower priority, as interruptible jobs. You can use `irq <https://docs.massive.org.au/M3/slurm/using-qos.html>`_ qos to run interruptible jobs when the computers are not in use. Please note that if you intend to run interruptible jobs, you need to make sure your script will checkpoint your work, as your job will be re-queued when there are higher priority jobs. 

There is a regular technical review process to understand the usage of this partition, in the form of a meeting which happens a minimum of twice a year (more frequently if there are matters to address). If you wish to participate in the review process, please indicate your interest via the MASSIVE `helpdesk <help@massive.org.au>`_.


Getting started with the Bioinformatics module
----------------------------------------------

Importing the Bioinformatics module environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

M3 has a number of bioinformatics packages available in the default set of modules. These include versions of bwa, 
bamtools, bcftools, bedtools, GATK, bcl2fastq, BEAST, BEAST2, bowtie, bowtie2, cufflinks, cytoscape, fastx-toolkit, 
kallisto, macs2, muscle, phyml, picard, qiime2, raxml, rstudio, samtools, star, sra-tools, subread, tophat, varscan, 
vep (this list shouldn't be regarded as exhaustive !).

A software stack of additional packages (known as 
`bio-ansible <https://github.com/MonashBioinformaticsPlatform/bio-ansible>`_) is maintained by the Monash 
Bioinformatics Platform (MBP). 

Tools are periodically added as required by the user community.

Modules maintained by MBP are installed at: ``/usr/local2/bioinformatics/``

To access these additional tools, type:

.. code-block:: bash

    source /usr/local2/bioinformatics/bioansible_env.sh

This loads the bio-ansible modules into your environment alongside the default M3 modules.
If you are using this frequently you might like to source this in your ``.profile`` / ``.bash_profile``.

To list all modules:

.. code-block:: bash

    module avail

You will see the additionals tools listed under the ``/usr/local2/bioinformatics/software/modules/bio`` section, 
followed by the default M3 modules.

Installing additional software with Bioconda
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In addition to the pre-installed modules available on M3, `Bioconda <https://bioconda.github.io/>`_ provides a 
streamlined way to install reproducible environments of bioinformatics tools in your home directory.

Conda is already installed on M3 under the `anaconda` module.

.. code-block:: bash
    
    module load anaconda

    conda config --add channels r
    conda config --add channels defaults
    conda config --add channels conda-forge
    conda config --add channels bioconda

These channels will now be listed in `~/.condarc` and will be searched when installing packages by name.

.. _conda_create:

Create a Conda enviroment for your analysis / pipeline / project
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Conda works by installing a set of pre-compiled tools and their dependencies in self-contained 'environments' which 
you can switch between.
Unlike modules loaded via ``module load``, you can only have a single Conda environment active at one time.

For example, to create an enviroment ``my_proj_env`` with a specific version of STAR, subread and asciigenome:

.. code-block:: bash

    # Change this to your M3 project ID
    export PROJECT=df22
    export CONDA_ENVS=/projects/$PROJECT/$USER/conda_envs

    mkdir -p $CONDA_ENVS

    module load anaconda

    conda create --yes -p $CONDA_ENVS/my_proj_env star=2.5.4a subread=1.5.2 asciigenome=1.12.0

    # To use the environment, activate it
    conda activate $CONDA_ENVS/my_proj_env

    # To leave the environment, deactivate it
    conda deactivate

You can search for packages on the command line with: ``conda search <package_name>``, or on the web using the 
`Bioconda recipes list <https://github.com/bioconda/bioconda-recipes>`_.

For further details see the official `Bioconda documentation <https://bioconda.github.io/#using-bioconda>`_.

Pipelines and workflow managers on M3
---------------------------------------

Running NextFlow on M3
^^^^^^^^^^^^^^^^^^^^^^
`NextFlow <https://www.nextflow.io/>`_ is a popular workflow manager that helps create portable, reproducible and resumable pipelines. 
It's well documented, actively developed and there are an increasing number of great example 
pipelines available, making it relatively easy to adopt a community supported one, or write your own.

NextFlow has good support for SLURM and runs on M3. Workflows can be made more portable and reproducible by assigning Singularity 
containers or Conda environments to each task.

Installing NextFlow
"""""""""""""""""""
The official `NextFlow installation instructions <https://www.nextflow.io/docs/latest/getstarted.html#installation>`_ 
work on M3. This is a great way to get started quickly. NextFlow can also be installed into a Conda environment (`conda install -c bioconda nextflow` - see the 
:ref:`Create a conda environment for your analysis / pipeline / project<conda_create>` section above) - since Nextflow requires Java, this can be an easy way
to ensure you are using a supported version of Java.

NextFlow configuration for M3
"""""""""""""""""""""""""""""
Create a `nextflow.config` file in the directory with your `.nf` workflow (or create `~/.nextflow/config` to make it
global for all workflows). Here's a good starting point for M3:

.. code-block:: bash

    /*
    // If you have access to the genomics partition, this function allows you to try jobs there first, 
    // and retries jobs that exceed the 4 hour walltime limit on 'comp'.

    def partition_switch(attempt) {(attempt == 1) ? '--partition=genomics --qos=genomics --time=0-4:00:00' : 
                                                    '--partition=comp --time=7-00:00:00'}
    */

    executor {
        name = 'slurm'
        queueSize = 200
        pollInterval = '10 sec'
        queueStatInterval = '10m'
    }

    singularity {
        enabled = true
        runOptions = '-B /scratch -B /projects -B /fs02 -B /fs03 -B /fs04'
        autoMounts = true
    }

    process {
        executor = 'slurm'
        stageInMode = 'symlink'
        errorStrategy = 'retry'
        maxRetries = 3
        cache = 'lenient'
        beforeScript = 'module load singularity'
        
        /* 
          You may wish to customize `clusterOptions` for your situation.
          `clusterOptions` are the like the options you might usually
           use with #SBATCH - but **don't** specify time, memory or cpus / tasks 
           in `clusterOptions` - Nextflow does that per-task
        */

        // If you have a particular M3 project ('account') you want to use, specify it here
        // clusterOptions = '--partition comp --account example123'

        // If you are using the genomics partition, consider using 
        // this version of clusterOptions instead
        // clusterOptions = { partition_switch(task.attempt) }
    }

    profiles {
        local {
            executor {
                name = 'local'
                queueSize = 32
                pollInterval = '30 sec'
            }
            process {
                executor = 'local'
                stageInMode = 'symlink'
                errorStrategy = 'retry' 
                maxRetries = 5
                cache = 'lenient'
                beforeScript = 'module load singularity/3.5.3'
            }
        }
    }

This config defaults to submitting tasks to the queue using your default SLURM account.
For testing you may wish to use the `local` profile via the NextFlow command line option `-profile local` 
- this is helpful for interactive testing in an `smux` session, but be careful not to accidentally run 
heavy tasks on the login node! For more detail on tweaking the configuration to your needs, see the
`Nextflow Configuration docs <https://www.nextflow.io/docs/latest/config.html>`_ and the many examples 
at `nf-core/configs <https://github.com/nf-core/configs/tree/master/conf>`_.

*(Thanks to Jason Steen for providing the starting point for this M3-specific config)*

If you'd like to run a quick test workflow, try the simple workflow below.
Save it as `example.nf` in the same directory as your `nextflow.config`.

.. code-block:: bash

    #!/usr/bin/env nextflow

    /*
    * A simple Nextflow FastQC example
    */

    // Declare commandline options
    params.inputPath = file('fastqs')
    params.outputPath = file('output')

    process fastqc {
    label 'fastqc'
    publishDir "${params.outputPath}", mode: 'copy'
    container "https://depot.galaxyproject.org/singularity/fastqc:0.12.1--hdfd78af_0"   
    cpus = 2
    memory = 8.GB

    input:
        tuple val(sample_name), path(fastqs)

    output:
        tuple val(sample_name), path("*.zip"), emit: zips
        tuple val(sample_name), path("*.html"), emit: html_reports

    script:
    """
    fastqc -t ${task.cpus} ${fastqs[0]} ${fastqs[1]}
    """
    }

    workflow {

    // Create a channel of input FASTQ files (catching most common filename variations)
    reads_ch = Channel.fromFilePairs(["${params.inputPath}/*_{1,2}.f*q.gz",
                                        "${params.inputPath}/*_{R1,R2}.f*q.gz",
                                        "${params.inputPath}/*_{R1_001,R2_001}.f*q.gz"])

    fastqc(reads_ch)
    }

Then run it like:

.. code-block:: bash
    
    # Get some example reads
    mkdir -p fastqs output
    curl https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR550/003/SRR5507343/SRR5507343_1.fastq.gz >fastqs/SRR5507343_1.fastq.gz
    curl https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR550/003/SRR5507343/SRR5507343_2.fastq.gz >fastqs/SRR5507343_2.fastq.gz

    # Load the Singularity module so container(s) can be pulled
    module load singularity
    
    # Run Nextflow
    nextflow run example.nf --inputPath=fastqs --outputPath=output

If you are using the `nextflow.config` as above, jobs will automatically be submitted to the queue (check `squeue -u $USER` while it's running).

The 'work' directory
""""""""""""""""""""

Nextflow puts intermediate files in a directory named `work` by default. While this directory is required to resume failed pipeline runs,
be aware that `work` can become quite large. Once your pipeline is finished, you should generally delete the `work` directory to free up space.

nf-core pipelines
"""""""""""""""""

If you are using one of the popular `nf-core <https://nf-co.re/>` pipelines, you may need to change the resources allocated to specific
tasks to suit your data or the available compute resources. This is done using `withName` and `withLabel` inside the `process {}`
section of the config. For example:

.. code-block:: bash
    
   process {
        withName: 'NFCORE_MAG:MAG:SPADES' {
            cpus = 16
            time = 7.d
            memory = 256.GB
            clusterOptions = '--partition=comp --time=7-00:00:00'

            // Ignore failed tasks so they don't stop the whole pipeline
            //errorStrategy = 'ignore'
    }
  }

It's a good idea to examine the `execution report <https://www.nextflow.io/docs/latest/tracing.html#resource-usage>` 
to determine if your workflow steps are using CPU cores and memory effciently - if the see a process is typically
under-utilizing memory or CPU you can reduce it's resource allocation as above (Why ? Your jobs will start faster and 
the share resource will be used more effciently).

See `"Tuning workflow resources" <https://nf-co.re/docs/usage/configuration#tuning-workflow-resources>`
for examples of using `withName` and `withLabel` in a `nextflow.config`. 


FAQ
---
Q : You have version `xx` and I need version `YY`, how do I get the software?

A : You should consider installing the software in your home directory with Conda. The `Bioconda <https://bioconda.github.io/>`_ project 
helps streamline this process with many pre-packaged tools for bioinformatics. If you are unable to install the 
version you need, please contact the helpdesk at `help@massive.org.au <mailto:help@massive.org.au>`_