Attention

This documentation is under active development, meaning that it can change over time as we refine it. Please email help@massive.org.au if you require assistance, or have suggestions to improve this documentation.

Data Collections on M3#

MASSIVE hosts a copy of the following reference data sets and large data collections with interest to host more data sets that are valuable to the community. You can learn more about how these data collections have been downloaded in the CVL Github Repository: https://github.com/Characterisation-Virtual-Laboratory/Data_Collections

Machine learning#

ImageNet 2012 (ILSVRC2012)#

  • Brief description:

    ImageNet is an image database organised according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. This is the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) version of ImageNet.

  • Date of data release:

    May 2012

  • Date of data download on M3

    2021-03-03

  • Data access process:

    M3 users can access the ILSVRC2012 version of the ImageNet database by registering their acceptance of the terms of access.

  • Location on M3:

/mnt/reference/imagenet

ImageNet 2015 Object Detection Data (ILSVRC2015 DET)#

  • Brief description:

    You can find more details in the development kit readme.txt file located at:

    /mnt/reference/imagenet-2015-det/imagenet-2015-det-20211001/development_kit/ILSVRC2015/devkit/readme.txt

    “There are 200 synsets in the DET dataset, and the validation and test results are evaluated on these synsets.

    The 200 synsets in the DET dataset are part of the larger ImageNet hierarchy. In the training set some object instances have been further labeled as belonging to a particular subcategory – for example, some instances of ‘dog’ in the training set may actually be associated with a more specific ‘fox terrier’ breed label. This is the ‘subcategory’ label of the ‘object’ element in the XML annotation.”

  • Date of data release:

    September 2014

  • Date of data download on M3

    2021-10-01

  • Data access process:

    M3 users can access the ImageNet 2015 Object Detection Dataset by registering their acceptance of the ImageNet DET terms of access.

  • Location on M3:

/mnt/reference/imagenet-2015-det
.
`-- imagenet-2015-det-20211001
    |-- development_kit
    |   `-- ILSVRC2015
    |       `-- devkit
    |           |-- data
    |           `-- evaluation
    |-- ILSVRC2015_DET
    |   |-- Annotations
    |   |   |-- train
    |   |   |   |-- ILSVRC2013_train
    |   |   |   |-- ILSVRC2014_train_0000
    |   |   |   |-- ILSVRC2014_train_0001
    |   |   |   |-- ILSVRC2014_train_0002
    |   |   |   |-- ILSVRC2014_train_0003
    |   |   |   |-- ILSVRC2014_train_0004
    |   |   |   |-- ILSVRC2014_train_0005
    |   |   |   `-- ILSVRC2014_train_0006
    |   |   `-- val
    |   |-- Data
    |   |   |-- train
    |   |   `-- val
    |   `-- ImageSets
    |-- test
    |   |-- Data
    |   `-- ImageSets
    `-- test-new
        |-- Data
        `-- ImageSets

International Skin Imaging Collaboration 2019 (ISIC 2019)#

  • Brief description:
    An International Skin Imaging Collaboration (ISIC) developed repository of dermoscopic images, for both the purposes of clinical training, and for supporting technical research toward automated algorithmic analysis. 25,331 images are available for training across 8 different categories;
    • Melanoma

    • Melanocytic nevus

    • Basal cell carcinoma

    • Actinic keratosis

    • Benign keratosis (solar lentigo / seborrheic keratosis / lichen planus-like keratosis)

    • Dermatofibroma

    • Vascular lesion

    • Squamous cell carcinoma

  • Date of data release:

    August 2019

  • Date of data download on M3:

    2020-10-02

  • Data access process:

    As advised on the ISIC webpage ISIC 2019 is immediately available to users and does not require a data usage agreement. To access this data on M3, please visit the HPC ID portal page for ISIC 2019 to register your data access request.

  • Location on M3:

/mnt/reference/isic-2019/

NIH Chest X-ray Dataset (NIH CXR-14)#

  • Brief description:
    The NIH Chest X-ray Dataset includes 112,120 frontal-view X-ray images from 30,805 unique patients, with text-mined image labels gathered from radiological reports using natural language processing. There are 14 labels, and images may contain multiple labels. These 14 common thoracic pathologies include;
    • Atelectasis

    • Consolidation

    • Infiltration

    • Pneumothorax

    • Edema

    • Emphysema

    • Fibrosis

    • Effusion

    • Pneumonia

    • Pleural thickening

    • Cardiomegaly

    • Nodule

    • Mass

    • Hernia

  • Date of data release:

    2017-09-26

  • Date of data download on M3:

    2020-10-29

  • Data access process:

    As advised on the NIH webpage the NIH Chest X-ray dataset is immediately available to users and does not require a data usage agreement. To access this data on M3, please visit this link.

  • Location on M3:

/mnt/reference/nih-cxr-14/

Stanford Natural Language Inference (SNLI) Corpus#

  • Brief description:

    The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.

  • Date of data release:

    August 2015

  • Date of data download on M3:

    2021-04-21

  • Data access process:

    M3 users can access the Stanford Natural Language Inference Corpus by registering their acceptance of the SNLI terms of access.

  • Location on M3:

/mnt/reference/snli-corpus

COCO (Common Objects in Context) 2017#

  • Brief description:
    COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:
    • Object segmentation

    • Recognition in context

    • Superpixel stuff segmentation

    • 330K images (>220K labeled)

    • 1.5 million object instances

    • 80 object categories

    • 91 stuff categories

    • 5 captions per image

    • 250,000 people with keypoints

  • Date of data release:

    September 2017

  • Date of data download on M3:

    2021-07-13

  • Data access process:

    M3 users can access the COCO 2017 dataset by registering their acceptance of the MS COCO terms of access.

  • Location on M3:

/mnt/reference/coco-2017
  • Data organisation on M3:

    COCO includes directories containing thousands of images. Directories which contain long lists of images will contain a file called imagelist.txt with a list of images in the directory. To reduce strain on the filesystem when navigating this data, we have organised large directories into subdirectories of around 40,000 files as follows:

    /mnt/reference/coco-2017/
    `-- coco-2017_20210713
        |-- coco_data_use_terms_and_conditions.txt
        |-- README
        `-- data
            |-- annotations
            |   |-- image_info_test_2017
            |   |-- image_info_unlabeled
            |   |-- panoptic_annotations_trainval2017
            |   |-- stuff_annotations_trainval2017
            |   `-- trainval2017
        `-- images
            |-- test2017
            |-- train2017
            |   |-- train2017-1
            |   |-- train2017-2
            |   `-- train2017-3
            |-- unlabeled2017
            |   |-- unlabeled2017-1
            |   |-- unlabeled2017-2
            |   `-- unlabeled2017-3
            `-- val2017
    
  • Link to the source:

    COCO website

AlphaFold#

Note

The AlphaFold dataset also includes model parameters. When you accept the terms and conditions of use in the HPC ID portal, take note that the AlphaFold parameters are made available for non-commercial use only, under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/legalcode. It is unclear if this license extends to cover outputs generated by AlphaFold.

  • Brief description:
    AlphaFold is an inference pipeline, and the CASP14 implementation was recently submitted to Nature. We have downloaded the datasets required to run the implementation hosted in the DeepMind AlphaFold Github repository. This includes multiple genetic sequence databases, including:
    • UniRef90

    • MGnify

    • BFD

    • Uniclust30

    • PDB70

    • PDB

    • The parameters required to run the AlphaFold model.

    You can run the AlphaFold model on this data by using:

module load alphafold/0bab1bf
/mnt/reference/alphafold/alphafold_20210726

AlphaFold v2 - AlphaFold-Multimer release#

Note

The AlphaFold dataset also includes model parameters. When you accept the terms and conditions of use in the HPC ID portal, take note that the AlphaFold parameters are made available for non-commercial use only, under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/legalcode. It is unclear if this license extends to cover outputs generated by AlphaFold.

  • Brief description:
    AlphaFold v2.0 is an inference pipeline. This is a completely new model that was entered in CASP14 and published in Nature. We have downloaded the datasets required to run the implementation hosted in the DeepMind AlphaFold Github repository. This includes multiple genetic sequence databases, including:
    • UniRef90

    • MGnify

    • BFD

    • Uniclust30

    • PDB70

    • PDB

    • The parameters required to run the AlphaFold model.

    You can run the AlphaFold model on this data by using the currently hidden module on M3:

module load alphafold/.2.1.1
/mnt/reference/alphafold/alphafold_20211129

Neuroimaging#

Human Connectome Project Dataset (HCP): HCP-1200#

  • Brief description:
    Copy of the following HCP data release is available on M3:
    • The 1200 Subjects Release (S1200) that includes behavioural and 3T MR imaging data from 1206 healthy young adult participants (1113 with structural MR scans) collected in 2012-2015. In addition to 3T MR scans, 184 subjects have multimodal 7T MRI scan data and 95 subjects also have some resting-state MEG (rMEG) and/or task MEG (tMEG) data available. For the first time, 3T MRI and behavioural retest data for 46 subjects is also available.

    NOTE: MASSIVE centrally manages a DTI preprocessed data which has ~38TB size for the HCP1200 dataset. This data is processed on MASSIVE and published by a researchers at Monash University. If using this data please make sure to cite (Oldham et al., 2019) and (Arnatkevičiūtė et al., 2021). Hopefully, it encourages M3 users to reuse the processed data rather than reproducing the same dataset to save processing time and also, storage space. Here you can find a document describes the processing steps and also related scripts used to generate this dataset.

  • Date of data release:

    2017-03-01

  • Data access process:

    This data is restricted to M3 users that have their email registered with HCP (i.e. an account) AND have accepted the HCP terms and conditions associated with the datasets. The following outlines the process of getting access to this data:

    • Create an Account at HCP
      1. Follow the “Get Data” link to: http://www.humanconnectome.org/data/

      2. Select “Log in to ConnectomeDB”

      3. Select “Register”

      4. Create an account and make sure that your email address matches the email associated with you MASSIVE account

      5. You will recieve an email to validate your account

    • Accept HCP data use Terms and Conditions
      1. Login to https://db.humanconnectome.org/

      2. Accept terms and conditions using the “Data Use Terms Required” button on all 3 datasets (WU-Minn, WU-Minn HCP Lifespan Pilot and MGH HCP Adult Diffusion)

    • Request Access to HCPdata at MASSIVE

      If you have completed the Human Connectome Project steps above you can request access with this link: HCPData. We will verify your MASSIVE email against the HCP site and grant access.

  • Location on M3:

/mnt/reference2/hcp1200
/projects/hcp1200_processed/2021
  • Tips on using data on M3:
    • HCP-1200 data is provided in a compressed format. We have put together an example script which is reasonably optimised to work with compressed data on High-Performance Computing hardware. Please see the script in the following CVL GitHub repository. Please consider relevant notes/comments in the README.md file. Using this script and the provided hints, HCP data can be uncompressed in a reasonable time without storing files in project’s /scratch directory.

    • You can create a symbolic link to the file location /mnt/reference2/hcp1200 in your home directory using:

      ln -s /mnt/reference2/hcp1200 ~/hcp1200
      ln -s /projects/hcp1200_processed/2021 ~/hcp1200_processed_2021
      
  • Link to the source:
  • Notes to consider:
    • HCP-1200 data version that is on M3 has major issues with preprocessed 7T fMRI data, as is recognised and reported by the HCP community in this HCP issue. Hence, we highly recommend to not use 7T fMRI data in this data collection. We will remove the questionable data from this data collection and replace it with reprocessed 7T fMRI data that was released in 2018. This is a version that has fixed issues as described in this link. We will update this document when the update is complete.

    • The Connectome Workbench is available via the MASSIVE Desktop. Any updates to this or other software requirements can be directed to help@massive.org.au

Lifespan Human Connectome Project Development#

  • Brief description:

    Early lifespan data from ages 5-21 including behavioural and neuroimaging data. The following is a list of the subset of HCP Development data we have available for the M3 users with permission to access the Lifespan HCP Development data collection: - Structural Preprocessed FreeSurfer - Structural Preprocessed Extended - Resting State fMRI Preprocessed Extended - Behavioral Data

  • Data version:

    2.0

  • Date of data release:

    2021-02-26

  • Date of data download on M3:

    May 2021

  • Data access process:
    • This data collection is restricted to M3 users who provide a valid Data Use Certificate (DUC) from the National Data Archive (NDA).

    • Access instructions, including how to create an account with NIH Data Archive, how to submit a request form including project details, and waiting for NDA approval, are described by the data custodian here

    • When you receive DUC from NDA, please forward it via email to help@massive.org.au.

    • This data has an expiry date. To keep your access to this data collection on M3, apply for a data access extension from NDA.

    • MASSIVE will keep this data on the storage as long as there is a user or a research group with “valid and active” access approval.

  • Location on M3:

/mnt/reference-restricted/hcp_lifespan/hcp_development_v2

Lifespan Human Connectome Project Aging#

  • Brief description:

    Later lifespan data aged 36-100+ including behavioural and neuroimaging data. The following is a list of subset of HCP Aging data we have available for the M3 users with permission to access the Lifespan HCP Aging data collection: - Structural Preprocessed FreeSurfer - Structural Preprocessed Extended - Resting State fMRI Preprocessed Extended - Diffusion Unprocessed - Behavioral Data

  • Data version:

    2.0

  • Date of data release:

    2021-02-26

  • Date of data download on M3:

    May 2021

  • Data access process:
    • This data collection is restricted to M3 users who provide a valid Data Use Certificate (DUC) from National Data Archive (NDA).

    • Access instructions, including how to create an account with NIH Data Archive, how to submit a request form including project details, and waiting for NDA approval, are described by the data custodian here

    • When you receive DUC from NDA, please forward it by email to help@massive.org.au.

    • This data has an expiry date. To keep your access to this data collection on M3, apply for data access extension from NDA.

    • MASSIVE will keep this data on the storage as long as there is a user or a research group with “valid and active” access approval.

  • Location on M3:

/mnt/reference-restricted/hcp_lifespan/hcp_aging_v2

Baby Connectome Project#

  • Brief description:

    A four-year study of children from birth through five years of age, intended to provide a better understanding of how the brain develops from infancy through early childhood and the factors that contribute to healthy brain development. (T1, rs-fMRI, DTI)

  • Data version:

    1.0 release

  • Date of data download on M3:

    Dec 2021

  • Data access process:
    • This data collection is restricted to M3 users who provide a valid Data Use Certificate (DUC) from National Data Archive (NDA).

    • Access instructions, including how to create account with NIH Data Archive, how to submit request form including project details, and how to wait for NDA approval, are described by the data custodian here

    • When you receive DUC from NDA, please forward it email to help@massive.org.au.

    • This data has an expiry date. To keep your access to this data collection on M3, apply for data access extension from NDA.

    • MASSIVE will keep this data on the storage as long as there is a user or a research group with “valid and active” access approval.

  • Location on M3:

/mnt/reference-restricted/hcp_baby/hcp_baby_20211210

Human Connectome Project for Early Psychosis#

  • Brief description:

    Clinical data for individuals with early phase psychosis within 5 years of onset including behavioural and neuroimaging data. The following is the list of data available on M3 for the M3 users permission to access the HCP Early Psychosis data collection: - Unprocessed data of all modalities (structural MRI, resting-state fMRI, and diffusion MRI) for 183 subjects - Minimally preprocessed structural MRI data for 169 subjects - Clinical and behavioral data for 251 subjects (68 more than were released previously)

  • Data version:

    1.1 release

  • Date of data release:

    2021-08-19

  • Date of data download on M3:

    Aug 2021

  • Data access process:
    • This data collection is restricted to M3 users who provide a valid Data Use Certificate (DUC) from National Data Archive (NDA).

    • Access instructions, including how to create an account with NIH Data Archive, submit request form including project details, wait for NDA approval, are described by data custodian here

    • When you receive DUC from NDA, please forward it email to help@massive.org.au.

    • This data has an expiry date. To keep your access to this data collection on M3, apply for data access extension from NDA.

    • MASSIVE will keep this data on the storage as long as there is a user or a research group with “valid and activate” access approval.

  • Location on M3:

/mnt/reference-restricted/hcp_ep/hcp_ep_v1.1

Developing Human Connectome Project (dHCP)#

  • Brief description:

    This data collection consists of images of 783 neonatal subjects (886 datasets). The imaging data includes structural imaging, structural connectivity data (diffusion MRI) and functional connectivity data (resting-state fMRI). This data release comes with minimal accompanying metadata: sex, age at birth, age at scan, birthweight, head circumference and radiology score. More information about available data in this collection can be found here.

  • Data version:

    3.1 release; including patch1 with previously missing data

  • Date of data release:

    June 2021

  • Date of data download on M3:

    Aug 2021

  • Data access process:
    • In order to be able to access this data, you will need to register by following this link <https://data.developingconnectome.org/app/template/Login.vm>_. On the dHCP data login page, select ‘Register’ and fill out the mandatory fields including Lab/Department, Institute Full Name, Country and your institutional email address.

    • You will receive a verification email from data custodian. After you verify your email address, you will be able to login to sign an agreement and access the public face of the dHCP database (db)

    • Please forward the access approval email to help@massive.org.au.

    • If you have completed the above steps you can request access this data on M3 with this link: Developing Human Connectome Project (dHCP).

  • Location on M3:

/mnt/reference2/developing_hcp

Brain Genomics Superstruct Project (GSP)#

  • Brief description:

    Large scale imaging data sets are necessary to address complex questions regarding the relationship between brain and behavior. The Brain Genomics Superstruct Project Open Access Data Release exposes a carefully vetted collection of neuroimaging, behavior, cognitive, and personality data for over 1,500 human participants. Each neuroimaging data set includes one high-resolution Magnetic Resonance Imaging (MRI) acquisition and one or more resting-state functional MRI acquisitions. Each functional acquisition is accompanied by a fully-automated quality assessment and pre-computed brain morphometrics are also provided. The imaging data are stored in 10 separate tar files, each containing 157 subjects. There is a single description .csv file that contains the demographic and phenotype data for all 1570 unique subjects. All 10 tar files have been downloaded to obtain the full n=1570 dataset. Also, tar files are uncompressed to help users with data processing.

  • Data version:

    10.5

  • Date of data release:

    2020-03-09

  • Date of data download on M3:

    2020-09-10

  • Data access process:
  • Location on M3:

/scratch/gsp/gsp-20200910

Nathan Kline Institute Rockland Sample (NKI-RS): Neuroimaging Release#

  • Brief description:

    The Rockland Sample is currently comprised of data from four studies, please see the NITRC website for information about those studies. NKI-RS Neuroimaging Release contains imaging data, physiological data acquired during scan acquisition (cardiac and respiratory), and limited phenotyping (age, sex, and handedness). No psychiatric, cognitive, or behavioral information is included. On this webpage you can find more information about scans that are included for subjects in the Cross-Sectional Lifespan Connectomics Study, Longitudinal Developmental Connectomics Study, and Mapping Interindividual Variation In The Aging Connectome studies. The latest data release of raw data organized in the BIDS format. This folder includes data from all the releases. Since BIDS makes provisions for phenotypic and data collected during scanning (physiological,event-related), this data is also included in this folder in addition to the MRI series NifTIs. DICOMs are not included.

  • Data version:

    RawDataBidsLatest

  • Date of data release:

    December 2019

  • Date of data download on M3:

    2020-08-19

  • Data access process:

    As advised on the NITRC website, Neuroimaging Release is immediately available to users and does not require a data usage agreement. To access this data on M3, please visit the HPC ID portal to register your data access request.

  • Location on M3:

/scratch/nkirs-ndr/RawDataBidsLatest_20200819
  • Tips on using data on M3:

    You can create a symbolic link to /scratch/nkirs-ndr/RawDataBidsLatest_20200819 in your home directory using:

ln -s /scratch/nkirs-ndr/RawDataBidsLatest_20200819 ~/nkir-ndr/RawDataBidsLatest_20200819

Genomes#

BlastDB#

  • Brief description:

    BLAST search pages under the Basic BLAST section of the NCBI BLAST home page use a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases are made available as compressed archives of pre-formatted form. The FASTA files reside under the /FASTA subdirectory.

  • Data access process:

    There is no restriction to access this database. M3 users can access this data by registering their acceptance of the BlastDB terms of access on M3

  • Location on M3:

/scratch/blastdb/FASTA

Requesting a data collection#

MASSIVE is interested in hosting reference data or collections that are valuable to the community. If you would like us to consider hosting a data collection please email help@massive.org.au with the following details:

  • data collection name

  • URL for download

  • size of data

  • urgency

  • if there are any ethics or licensing restrictions we need to impose

  • details of the community or other users who would find this collection useful