Attention

This documentation is under active development, meaning that it can change over time as we refine it. Please email help@massive.org.au if you require assistance, or have suggestions to improve this documentation.

Data Collections on M3#

MASSIVE hosts a copy of the following reference data sets and large data collections with interest to host more data sets that are valuable to the community. You can learn more about how these data collections have been downloaded in the CVL Github Repository: https://github.com/Characterisation-Virtual-Laboratory/Data_Collections

Machine learning#

ImageNet 2012 (ILSVRC2012)#

Brief description:
ImageNet is an image database organised according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. This is the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) version of ImageNet.
Date of data release:
May 2012
Date of data download on M3
2021-03-03
Data access process:
M3 users can access the ILSVRC2012 version of the ImageNet database by registering their acceptance of the terms of access.
Location on M3:

/mnt/reference/imagenet

Link to the source:
ImageNet website

ImageNet 2015 Object Detection Data (ILSVRC2015 DET)#

Brief description:
You can find more details in the development kit readme.txt file located at:

/mnt/reference/imagenet-2015-det/imagenet-2015-det-20211001/development_kit/ILSVRC2015/devkit/readme.txt

“There are 200 synsets in the DET dataset, and the validation and test results are evaluated on these synsets.

The 200 synsets in the DET dataset are part of the larger ImageNet hierarchy. In the training set some object instances have been further labeled as belonging to a particular subcategory – for example, some instances of ‘dog’ in the training set may actually be associated with a more specific ‘fox terrier’ breed label. This is the ‘subcategory’ label of the ‘object’ element in the XML annotation.”
Date of data release:
September 2014
Date of data download on M3
2021-10-01
Data access process:
M3 users can access the ImageNet 2015 Object Detection Dataset by registering their acceptance of the ImageNet DET terms of access.
Location on M3:

/mnt/reference/imagenet-2015-det

Link to the source:
ImageNet 2015 Website
Data organisation on M3:

.
`-- imagenet-2015-det-20211001
    |-- development_kit
    |   `-- ILSVRC2015
    |       `-- devkit
    |           |-- data
    |           `-- evaluation
    |-- ILSVRC2015_DET
    |   |-- Annotations
    |   |   |-- train
    |   |   |   |-- ILSVRC2013_train
    |   |   |   |-- ILSVRC2014_train_0000
    |   |   |   |-- ILSVRC2014_train_0001
    |   |   |   |-- ILSVRC2014_train_0002
    |   |   |   |-- ILSVRC2014_train_0003
    |   |   |   |-- ILSVRC2014_train_0004
    |   |   |   |-- ILSVRC2014_train_0005
    |   |   |   `-- ILSVRC2014_train_0006
    |   |   `-- val
    |   |-- Data
    |   |   |-- train
    |   |   `-- val
    |   `-- ImageSets
    |-- test
    |   |-- Data
    |   `-- ImageSets
    `-- test-new
        |-- Data
        `-- ImageSets

International Skin Imaging Collaboration 2019 (ISIC 2019)#

Brief description:
An International Skin Imaging Collaboration (ISIC) developed repository of dermoscopic images, for both the purposes of clinical training, and for supporting technical research toward automated algorithmic analysis. 25,331 images are available for training across 8 different categories;
Melanoma

Melanocytic nevus

Basal cell carcinoma

Actinic keratosis

Benign keratosis (solar lentigo / seborrheic keratosis / lichen planus-like keratosis)

Dermatofibroma

Vascular lesion

Squamous cell carcinoma
Date of data release:
August 2019
Date of data download on M3:
2020-10-02
Data access process:
As advised on the ISIC webpage ISIC 2019 is immediately available to users and does not require a data usage agreement. To access this data on M3, please visit the HPC ID portal page for ISIC 2019 to register your data access request.
Location on M3:

/mnt/reference/isic-2019/

Link to the source
ISIC 2019 website

NIH Chest X-ray Dataset (NIH CXR-14)#

Brief description:
The NIH Chest X-ray Dataset includes 112,120 frontal-view X-ray images from 30,805 unique patients, with text-mined image labels gathered from radiological reports using natural language processing. There are 14 labels, and images may contain multiple labels. These 14 common thoracic pathologies include;
Atelectasis

Consolidation

Infiltration

Pneumothorax

Edema

Emphysema

Fibrosis

Effusion

Pneumonia

Pleural thickening

Cardiomegaly

Nodule

Mass

Hernia
Date of data release:
2017-09-26
Date of data download on M3:
2020-10-29
Data access process:
As advised on the NIH webpage the NIH Chest X-ray dataset is immediately available to users and does not require a data usage agreement. To access this data on M3, please visit this link.
Location on M3:

/mnt/reference/nih-cxr-14/

Link to the source
- The NIH Chest X-ray Dataset homepage
- The NIH Chest X-ray Dataset storage

Stanford Natural Language Inference (SNLI) Corpus#

Brief description:
The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.
Date of data release:
August 2015
Date of data download on M3:
2021-04-21
Data access process:
M3 users can access the Stanford Natural Language Inference Corpus by registering their acceptance of the SNLI terms of access.
Location on M3:

/mnt/reference/snli-corpus

Link to the source:
Stanford Natural Language Inference (SNLI) Corpus website

COCO (Common Objects in Context) 2017#

Brief description:
COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:
Object segmentation

Recognition in context

Superpixel stuff segmentation

330K images (>220K labeled)

1.5 million object instances

80 object categories

91 stuff categories

5 captions per image

250,000 people with keypoints
Date of data release:
September 2017
Date of data download on M3:
2021-07-13
Data access process:
M3 users can access the COCO 2017 dataset by registering their acceptance of the MS COCO terms of access.
Location on M3:

/mnt/reference/coco-2017

Data organisation on M3:

COCO includes directories containing thousands of images. Directories which contain long lists of images will contain a file called imagelist.txt with a list of images in the directory. To reduce strain on the filesystem when navigating this data, we have organised large directories into subdirectories of around 40,000 files as follows:

/mnt/reference/coco-2017/
`-- coco-2017_20210713
    |-- coco_data_use_terms_and_conditions.txt
    |-- README
    `-- data
        |-- annotations
        |   |-- image_info_test_2017
        |   |-- image_info_unlabeled
        |   |-- panoptic_annotations_trainval2017
        |   |-- stuff_annotations_trainval2017
        |   `-- trainval2017
    `-- images
        |-- test2017
        |-- train2017
        |   |-- train2017-1
        |   |-- train2017-2
        |   `-- train2017-3
        |-- unlabeled2017
        |   |-- unlabeled2017-1
        |   |-- unlabeled2017-2
        |   `-- unlabeled2017-3
        `-- val2017

Link to the source:
COCO website

AlphaFold#

Note

The AlphaFold dataset also includes model parameters. When you accept the terms and conditions of use in the HPC ID portal, take note that the AlphaFold parameters are made available for non-commercial use only, under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/legalcode. It is unclear if this license extends to cover outputs generated by AlphaFold.

Brief description:
AlphaFold is an inference pipeline, and the CASP14 implementation was recently submitted to Nature. We have downloaded the datasets required to run the implementation hosted in the DeepMind AlphaFold Github repository. This includes multiple genetic sequence databases, including:
UniRef90

MGnify

BFD

Uniclust30

PDB70

PDB

The parameters required to run the AlphaFold model.
You can run the AlphaFold model on this data by using:

module load alphafold/0bab1bf

Date of data release:
This varies depending on each individual data set. We used the AlphaFold data download script scripts/download_all_data.sh from their Github repository.
Date of data download on M3:
2021-07-26
Data access process:
M3 users can access the AlphaFold dataset by registering their acceptance of the AlphaFold terms of access.
Location on M3:

/mnt/reference/alphafold/alphafold_20210726

Link to the source:
https://github.com/deepmind/alphafold

AlphaFold v2 - AlphaFold-Multimer release#

Note

Brief description:
AlphaFold v2.0 is an inference pipeline. This is a completely new model that was entered in CASP14 and published in Nature. We have downloaded the datasets required to run the implementation hosted in the DeepMind AlphaFold Github repository. This includes multiple genetic sequence databases, including:
UniRef90

MGnify

BFD

Uniclust30

PDB70

PDB

The parameters required to run the AlphaFold model.
You can run the AlphaFold model on this data by using the currently hidden module on M3:

module load alphafold/.2.1.1

Date of data release:
This varies depending on each individual data set. We used the AlphaFold data download script scripts/download_all_data.sh from their Github repository., and then followed instructions under Updating existing AlphaFold installation to include AlphaFold-Multimers.
Date of data download on M3:
2021-11-29
Data access process:
M3 users can access the AlphaFold v2 dataset by registering their acceptance of the AlphaFold v2 terms of access.
Location on M3:

/mnt/reference/alphafold/alphafold_20211129

Link to the source:
https://github.com/deepmind/alphafold

Neuroimaging#

Human Connectome Project Dataset (HCP): HCP-1200#

Brief description:
Copy of the following HCP data release is available on M3:
The 1200 Subjects Release (S1200) that includes behavioural and 3T MR imaging data from 1206 healthy young adult participants (1113 with structural MR scans) collected in 2012-2015. In addition to 3T MR scans, 184 subjects have multimodal 7T MRI scan data and 95 subjects also have some resting-state MEG (rMEG) and/or task MEG (tMEG) data available. For the first time, 3T MRI and behavioural retest data for 46 subjects is also available.
NOTE: MASSIVE centrally manages a DTI preprocessed data which has ~38TB size for the HCP1200 dataset. This data is processed on MASSIVE and published by a researchers at Monash University. If using this data please make sure to cite (Oldham et al., 2019) and (Arnatkevičiūtė et al., 2021). Hopefully, it encourages M3 users to reuse the processed data rather than reproducing the same dataset to save processing time and also, storage space. Here you can find a document describes the processing steps and also related scripts used to generate this dataset.
Date of data release:
2017-03-01
Data access process:
This data is restricted to M3 users that have their email registered with HCP (i.e. an account) AND have accepted the HCP terms and conditions associated with the datasets. The following outlines the process of getting access to this data:
- Create an Account at HCP
  
  Follow the “Get Data” link to: http://www.humanconnectome.org/data/
  
  Select “Log in to ConnectomeDB”
  
  Select “Register”
  
  Create an account and make sure that your email address matches the email associated with you MASSIVE account
  
  You will recieve an email to validate your account
- Accept HCP data use Terms and Conditions
  
  Login to https://db.humanconnectome.org/
  
  Accept terms and conditions using the “Data Use Terms Required” button on all 3 datasets (WU-Minn, WU-Minn HCP Lifespan Pilot and MGH HCP Adult Diffusion)
- Request Access to HCPdata at MASSIVE
  If you have completed the Human Connectome Project steps above you can request access with this link: HCPData. We will verify your MASSIVE email against the HCP site and grant access.
Location on M3:

/mnt/reference2/hcp1200
/projects/hcp1200_processed/2021

Tips on using data on M3:
- HCP-1200 data is provided in a compressed format. We have put together an example script which is reasonably optimised to work with compressed data on High-Performance Computing hardware. Please see the script in the following CVL GitHub repository. Please consider relevant notes/comments in the README.md file. Using this script and the provided hints, HCP data can be uncompressed in a reasonable time without storing files in project’s /scratch directory.
- You can create a symbolic link to the file location /mnt/reference2/hcp1200 in your home directory using:
  ln -s /mnt/reference2/hcp1200 ~/hcp1200 ln -s /projects/hcp1200_processed/2021 ~/hcp1200_processed_2021
Link to the source:
- Human Connectome Project Data Releases
- Link to 1200 subjects data release manual reference manual.
Notes to consider:
- HCP-1200 data version that is on M3 has major issues with preprocessed 7T fMRI data, as is recognised and reported by the HCP community in this HCP issue. Hence, we highly recommend to not use 7T fMRI data in this data collection. We will remove the questionable data from this data collection and replace it with reprocessed 7T fMRI data that was released in 2018. This is a version that has fixed issues as described in this link. We will update this document when the update is complete.
- The Connectome Workbench is available via the MASSIVE Desktop. Any updates to this or other software requirements can be directed to help@massive.org.au

Lifespan Human Connectome Project Development#

Brief description:
Early lifespan data from ages 5-21 including behavioural and neuroimaging data. The following is a list of the subset of HCP Development data we have available for the M3 users with permission to access the Lifespan HCP Development data collection: - Structural Preprocessed FreeSurfer - Structural Preprocessed Extended - Resting State fMRI Preprocessed Extended - Behavioral Data
Data version:
2.0
Date of data release:
2021-02-26
Date of data download on M3:
May 2021
Data access process:
- This data collection is restricted to M3 users who provide a valid Data Use Certificate (DUC) from the National Data Archive (NDA).
- Access instructions, including how to create an account with NIH Data Archive, how to submit a request form including project details, and waiting for NDA approval, are described by the data custodian here
- When you receive DUC from NDA, please forward it via email to help@massive.org.au.
- This data has an expiry date. To keep your access to this data collection on M3, apply for a data access extension from NDA.
- MASSIVE will keep this data on the storage as long as there is a user or a research group with “valid and active” access approval.
Location on M3:

/mnt/reference-restricted/hcp_lifespan/hcp_development_v2

Link to the source:
- HCP lifespan Development and Link to NDA page

Lifespan Human Connectome Project Aging#

Brief description:
Later lifespan data aged 36-100+ including behavioural and neuroimaging data. The following is a list of subset of HCP Aging data we have available for the M3 users with permission to access the Lifespan HCP Aging data collection: - Structural Preprocessed FreeSurfer - Structural Preprocessed Extended - Resting State fMRI Preprocessed Extended - Diffusion Unprocessed - Behavioral Data
Data version:
2.0
Date of data release:
2021-02-26
Date of data download on M3:
May 2021
Data access process:
- This data collection is restricted to M3 users who provide a valid Data Use Certificate (DUC) from National Data Archive (NDA).
- Access instructions, including how to create an account with NIH Data Archive, how to submit a request form including project details, and waiting for NDA approval, are described by the data custodian here
- When you receive DUC from NDA, please forward it by email to help@massive.org.au.
- This data has an expiry date. To keep your access to this data collection on M3, apply for data access extension from NDA.
- MASSIVE will keep this data on the storage as long as there is a user or a research group with “valid and active” access approval.
Location on M3:

/mnt/reference-restricted/hcp_lifespan/hcp_aging_v2

Link to the source:
- HCP lifespan Aging and Link to NDA page

Baby Connectome Project#

Brief description:
A four-year study of children from birth through five years of age, intended to provide a better understanding of how the brain develops from infancy through early childhood and the factors that contribute to healthy brain development. (T1, rs-fMRI, DTI)
Data version:
1.0 release
Date of data download on M3:
Dec 2021
Data access process:
- This data collection is restricted to M3 users who provide a valid Data Use Certificate (DUC) from National Data Archive (NDA).
- Access instructions, including how to create account with NIH Data Archive, how to submit request form including project details, and how to wait for NDA approval, are described by the data custodian here
- When you receive DUC from NDA, please forward it email to help@massive.org.au.
- This data has an expiry date. To keep your access to this data collection on M3, apply for data access extension from NDA.
- MASSIVE will keep this data on the storage as long as there is a user or a research group with “valid and active” access approval.
Location on M3:

/mnt/reference-restricted/hcp_baby/hcp_baby_20211210

Link to the source:
- Baby Connectome Project and Link to NDA page

Human Connectome Project for Early Psychosis#

Brief description:
Clinical data for individuals with early phase psychosis within 5 years of onset including behavioural and neuroimaging data. The following is the list of data available on M3 for the M3 users permission to access the HCP Early Psychosis data collection: - Unprocessed data of all modalities (structural MRI, resting-state fMRI, and diffusion MRI) for 183 subjects - Minimally preprocessed structural MRI data for 169 subjects - Clinical and behavioral data for 251 subjects (68 more than were released previously)
Data version:
1.1 release
Date of data release:
2021-08-19
Date of data download on M3:
Aug 2021
Data access process:
- This data collection is restricted to M3 users who provide a valid Data Use Certificate (DUC) from National Data Archive (NDA).
- Access instructions, including how to create an account with NIH Data Archive, submit request form including project details, wait for NDA approval, are described by data custodian here
- When you receive DUC from NDA, please forward it email to help@massive.org.au.
- This data has an expiry date. To keep your access to this data collection on M3, apply for data access extension from NDA.
- MASSIVE will keep this data on the storage as long as there is a user or a research group with “valid and activate” access approval.
Location on M3:

/mnt/reference-restricted/hcp_ep/hcp_ep_v1.1

Link to the source:
- HCP for Early Psychosis and Link to NDA page

Developing Human Connectome Project (dHCP)#

Brief description:
This data collection consists of images of 783 neonatal subjects (886 datasets). The imaging data includes structural imaging, structural connectivity data (diffusion MRI) and functional connectivity data (resting-state fMRI). This data release comes with minimal accompanying metadata: sex, age at birth, age at scan, birthweight, head circumference and radiology score. More information about available data in this collection can be found here.
Data version:
3.1 release; including patch1 with previously missing data
Date of data release:
June 2021
Date of data download on M3:
Aug 2021
Data access process:
- In order to be able to access this data, you will need to register by following this link <https://data.developingconnectome.org/app/template/Login.vm>_. On the dHCP data login page, select ‘Register’ and fill out the mandatory fields including Lab/Department, Institute Full Name, Country and your institutional email address.
- You will receive a verification email from data custodian. After you verify your email address, you will be able to login to sign an agreement and access the public face of the dHCP database (db)
- Please forward the access approval email to help@massive.org.au.
- If you have completed the above steps you can request access this data on M3 with this link: Developing Human Connectome Project (dHCP).
Location on M3:

/mnt/reference2/developing_hcp

Link to the source:
- Developing HCP and Link to the third data release

Brain Genomics Superstruct Project (GSP)#

Brief description:
Large scale imaging data sets are necessary to address complex questions regarding the relationship between brain and behavior. The Brain Genomics Superstruct Project Open Access Data Release exposes a carefully vetted collection of neuroimaging, behavior, cognitive, and personality data for over 1,500 human participants. Each neuroimaging data set includes one high-resolution Magnetic Resonance Imaging (MRI) acquisition and one or more resting-state functional MRI acquisitions. Each functional acquisition is accompanied by a fully-automated quality assessment and pre-computed brain morphometrics are also provided. The imaging data are stored in 10 separate tar files, each containing 157 subjects. There is a single description .csv file that contains the demographic and phenotype data for all 1570 unique subjects. All 10 tar files have been downloaded to obtain the full n=1570 dataset. Also, tar files are uncompressed to help users with data processing.
Data version:
10.5
Date of data release:
2020-03-09
Date of data download on M3:
2020-09-10
Data access process:
- Be sure to read though and accept the GSP terms and conditions
- Please follow the instructions on the GSP website to create a Dataverse account and go to the GSP Dataverse page (data version = 10.5), click on request access next to each restricted file. and get approval from the source to access this data set. It may take a few days for the access to be granted by source.
- Please forward the access approval email to help@massive.org.au.
- If you have completed the above steps you can request access this data on M3 with this link: Brain Genomics Superstruct Project (GSP).
- We will review your approval email and grant access.
Location on M3:

/scratch/gsp/gsp-20200910

Tips on using data on M3:
You can create a symbolic link to /scratch/gsp/gsp-20200910 in your home directory using:
```
ln -s /scratch/gsp/gsp-20200910 ~/gsp-20200910
```
Link to the source:
- GSP at Harvard University, Neuroinformatics Research Group
- GSP data in the Dataverse Harvard. Metadata can be downloaded from the same link.

Nathan Kline Institute Rockland Sample (NKI-RS): Neuroimaging Release#

Brief description:
The Rockland Sample is currently comprised of data from four studies, please see the NITRC website for information about those studies. NKI-RS Neuroimaging Release contains imaging data, physiological data acquired during scan acquisition (cardiac and respiratory), and limited phenotyping (age, sex, and handedness). No psychiatric, cognitive, or behavioral information is included. On this webpage you can find more information about scans that are included for subjects in the Cross-Sectional Lifespan Connectomics Study, Longitudinal Developmental Connectomics Study, and Mapping Interindividual Variation In The Aging Connectome studies. The latest data release of raw data organized in the BIDS format. This folder includes data from all the releases. Since BIDS makes provisions for phenotypic and data collected during scanning (physiological,event-related), this data is also included in this folder in addition to the MRI series NifTIs. DICOMs are not included.
Data version:
RawDataBidsLatest
Date of data release:
December 2019
Date of data download on M3:
2020-08-19
Data access process:
As advised on the NITRC website, Neuroimaging Release is immediately available to users and does not require a data usage agreement. To access this data on M3, please visit the HPC ID portal to register your data access request.
Location on M3:

/scratch/nkirs-ndr/RawDataBidsLatest_20200819

Tips on using data on M3:
You can create a symbolic link to /scratch/nkirs-ndr/RawDataBidsLatest_20200819 in your home directory using:

ln -s /scratch/nkirs-ndr/RawDataBidsLatest_20200819 ~/nkir-ndr/RawDataBidsLatest_20200819

Link to the source:
- Enhanced Nathan Kline Institute - Rockland Sample
- Link to data documentation can be found on the NITRC documentation page.

Genomes#

BlastDB#

Brief description:
BLAST search pages under the Basic BLAST section of the NCBI BLAST home page use a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases are made available as compressed archives of pre-formatted form. The FASTA files reside under the /FASTA subdirectory.
Data access process:
There is no restriction to access this database. M3 users can access this data by registering their acceptance of the BlastDB terms of access on M3
Location on M3:

/scratch/blastdb/FASTA

Link to the source:
BLAST software and databases

Requesting a data collection#

MASSIVE is interested in hosting reference data or collections that are valuable to the community. If you would like us to consider hosting a data collection please email help@massive.org.au with the following details:

data collection name

URL for download

size of data

urgency

if there are any ethics or licensing restrictions we need to impose

details of the community or other users who would find this collection useful