Attention
This documentation is under active development, meaning that it can change over time as we refine it. Please email help@massive.org.au if you require assistance, or have suggestions to improve this documentation.
Data Collections on M3¶
MASSIVE hosts a copy of the following reference data sets and large data collections with interest to host more data sets that are valuable to the community.
Machine learning¶
ImageNet¶
- Brief description:
ImageNet is an image database organised according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.
- Data access process:
M3 users can access the Fall 2011 version of the ImageNet database by registering their acceptance of the terms of access.
Location on M3:
/scratch/imagenet
- Link to the source:
International Skin Imaging Collaboration 2019 (ISIC 2019)¶
- Brief description:
- An International Skin Imaging Collaboration (ISIC) developed repository of dermoscopic images, for both the purposes of clinical training, and for supporting technical research toward automated algorithmic analysis. 25,331 images are available for training across 8 different categories;
Melanoma
Melanocytic nevus
Basal cell carcinoma
Actinic keratosis
Benign keratosis (solar lentigo / seborrheic keratosis / lichen planus-like keratosis)
Dermatofibroma
Vascular lesion
Squamous cell carcinoma
- Date of data release:
August 2019
- Date of data download on M3:
2020-10-02
Location on M3:
/scratch/isic-2019/isic-2019-20201002
- Link to the source
NIH Chest X-ray Dataset (NIH CXR-14)¶
- Brief description:
- The NIH Chest X-ray Dataset includes 112,120 frontal-view X-ray images from 30,805 unique patients, with text-mined image labels gathered from radiological reports using natural language processing. There are 14 labels, and images may contain multiple labels. These 14 common thoracic pathologies include;
Atelectasis
Consolidation
Infiltration
Pneumothorax
Edema
Emphysema
Fibrosis
Effusion
Pneumonia
Pleural thickening
Cardiomegaly
Nodule
Mass
Hernia
- Date of data release:
2017-09-26
- Date of data download on M3:
2020-10-29
Location on M3:
/scratch/nih-cxr-14/nih-cxr-14_20201029
- Link to the source
Neuroimaging¶
Human Connectome Project Dataset (HCP): HCP-1200¶
- Brief description:
- Copy of the following HCP data release is available on M3:
The 1200 Subjects Release (S1200) that includes behavioural and 3T MR imaging data from 1206 healthy young adult participants (1113 with structural MR scans) collected in 2012-2015. In addition to 3T MR scans, 184 subjects have multimodal 7T MRI scan data and 95 subjects also have some resting-state MEG (rMEG) and/or task MEG (tMEG) data available. For the first time, 3T MRI and behavioural retest data for 46 subjects is also available.
- Date of data release:
2017-03-01
- Data access process:
This data is restricted to M3 users that have their email registered with HCP (i.e. an account) AND have accepted the HCP terms and conditions associated with the datasets. The following outlines the process of getting access to this data:
- Create an Account at HCP
Follow the “Get Data” link to: http://www.humanconnectome.org/data/
Select “Log in to ConnectomeDB”
Select “Register”
Create an account and make sure that your email address matches the email associated with you MASSIVE account
You will recieve an email to validate your account
- Accept HCP data use Terms and Conditions
Login to https://db.humanconnectome.org/
Accept terms and conditions using the “Data Use Terms Required” button on all 3 datasets (WU-Minn, WU-Minn HCP Lifespan Pilot and MGH HCP Adult Diffusion)
- Request Access to HCPdata at MASSIVE
If you have completed the Human Connectome Project steps above you can request access with this link: HCPData. We will verify your MASSIVE email against the HCP site and grant access.
Location on M3:
/scratch/hcp1200
- Tips on using data on M3:
HCP-1200 data is in the compressed format. We have put together an example script which is reasonably optimised to work with compressed data on High-Performance Computing. Please see the script in the following CVL GitHub repository. Please consider relevant notes/comments in the README.md file. Using this script and provided hints, HCP data can be uncompressed in a reasonable time without storing files in project’s scratch directory.
You can create a symbolic link to the file location /scratch/hcp1200 in your home directory using:
ln -s /scratch/hcp1200 ~/hcp1200
- Link to the source:
Link to 1200 subjects data release manual reference manual.
- Notes to consider:
HCP-1200 data version that is on M3 has major issues with preprocessed 7T fMRI data, as is recognised and reported by the HCP community here. Hence, we highly recommend to not use 7T fMRI data in this data collection. We will remove the questionable data from this data collection and replace it with reprocessed 7T fMRI data that was released in 2018. This is a version that has fixed issues as described in this link. We will update this document when the update is complete.
The Connectome Workbench is available via the MASSIVE Desktop. Any updates to this or other software requirements can be directed to help@massive.org.au
Brain Genomics Superstruct Project (GSP)¶
- Brief description:
Large scale imaging data sets are necessary to address complex questions regarding the relationship between brain and behavior. The Brain Genomics Superstruct Project Open Access Data Release exposes a carefully vetted collection of neuroimaging, behavior, cognitive, and personality data for over 1,500 human participants. Each neuroimaging data set includes one high-resolution Magnetic Resonance Imaging (MRI) acquisition and one or more resting-state functional MRI acquisitions. Each functional acquisition is accompanied by a fully-automated quality assessment and pre-computed brain morphometrics are also provided. The imaging data are stored in 10 separate tar files, each containing 157 subjects. There is a single description .csv file that contains the demographic and phenotype data for all 1570 unique subjects. All 10 tar files have been downloaded to obtain the full n=1570 dataset. Also, tar files are uncompressed to help users with data processing.
- Data version:
10.5
- Date of data release:
2020-03-09
- Date of data download on M3:
2020-09-10
- Data access process:
Be sure to read though and accept the GSP terms and conditions
Please follow the instructions here to create a Dataverse account and go to the GSP Dataverse page (data version = 10.5), click on request access next to each restricted file. and get approval from the source to access this data set. It may take a few days for the access to be granted by source.
Please forward the access approval email to help@massive.org.au.
If you have completed the above steps you can request access this data on M3 with this link: Brain Genomics Superstruct Project (GSP).
We will review your approval email and grant access.
Location on M3:
/scratch/gsp/gsp-20200910
- Tips on using data on M3:
You can create a symbolic link to /scratch/gsp/gsp-20200910 in your home directory using:
ln -s /scratch/gsp/gsp-20200910 ~/gsp-20200910
- Link to the source:
GSP data in the Dataverse Harvard. Metadata can be downloaded from the same link.
Nathan Kline Institute Rockland Sample (NKI-RS): Neuroimaging Release¶
- Brief description:
The Rockland Sample is currently comprised of data from four studies, please see here for information about those studies. NKI-RS Neuroimaging Release contains imaging data, physiological data acquired during scan acquisition (cardiac and respiratory), and limited phenotyping (age, sex, and handedness). No psychiatric, cognitive, or behavioral information is included. here you can find more information about scans that are included for subjects in the Cross-Sectional Lifespan Connectomics Study, Longitudinal Developmental Connectomics Study, and Mapping Interindividual Variation In The Aging Connectome studies. The latest data release of raw data organized in the BIDS format. This folder includes data from all the releases. Since BIDS makes provisions for phenotypic and data collected during scanning (physiological,event-related), this data is also included in this folder in addition to the MRI series NifTIs. DICOMs are not included.
- Data version:
RawDataBidsLatest
- Date of data release:
December 2019
- Date of data download on M3:
2020-08-19
Location on M3:
/scratch/nkirs-ndr/RawDataBidsLatest_20200819
- Tips on using data on M3:
You can create a symbolic link to /scratch/nkirs-ndr/RawDataBidsLatest_20200819 in your home directory using:
ln -s /scratch/nkirs-ndr/RawDataBidsLatest_20200819 ~/nkir-ndr/RawDataBidsLatest_20200819
- Link to the source:
Link to data documentation can be found here.
Genomes¶
BlastDB¶
- Brief description:
BLAST search pages under the Basic BLAST section of the NCBI BLAST home page use a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases are made available as compressed archives of pre-formatted form. The FASTA files reside under the /FASTA subdirectory.
- Data access process:
There is no restriction to access this database. M3 users can access this data by registering their acceptance of the terms of access on M3
Location on M3:
/scratch/blastdb/FASTA
- Link to the source:
Requesting a data collection¶
MASSIVE is interested in hosting reference data or collections that are valuable to the community. If you would like us to consider hosting a data collection please email help@massive.org.au with the following details:
data collection name
URL for download
size of data
urgency
if there are any ethics or licensing restrictions we need to impose
details of the community or other users who would find this collection useful