This documentation is under active development, meaning that it can change over time as we refine it. Please email if you require assistance, or have suggestions to improve this documentation.

Data Collections on M3

MASSIVE hosts a copy of the following reference data sets and large data collections with interest to host more data sets that are valuable to the community.

Machine learning


  • Brief description:

    ImageNet is an image database organised according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.

  • Data access process:

    M3 users can access the Fall 2011 version of the ImageNet database by registering their acceptance of the terms of access.

  • Location on M3:


International Skin Imaging Collaboration 2019 (ISIC 2019)

  • Brief description:
    An International Skin Imaging Collaboration (ISIC) developed repository of dermoscopic images, for both the purposes of clinical training, and for supporting technical research toward automated algorithmic analysis. 25,331 images are available for training across 8 different categories;
    • Melanoma

    • Melanocytic nevus

    • Basal cell carcinoma

    • Actinic keratosis

    • Benign keratosis (solar lentigo / seborrheic keratosis / lichen planus-like keratosis)

    • Dermatofibroma

    • Vascular lesion

    • Squamous cell carcinoma

  • Date of data release:

    August 2019

  • Date of data download on M3:


  • Data access process:

    As advised here ISIC 2019 is immediately available to users and does not require a data usage agreement. To access this data on M3, please visit this link to register your data access request.

  • Location on M3:


NIH Chest X-ray Dataset (NIH CXR-14)

  • Brief description:
    The NIH Chest X-ray Dataset includes 112,120 frontal-view X-ray images from 30,805 unique patients, with text-mined image labels gathered from radiological reports using natural language processing. There are 14 labels, and images may contain multiple labels. These 14 common thoracic pathologies include;
    • Atelectasis

    • Consolidation

    • Infiltration

    • Pneumothorax

    • Edema

    • Emphysema

    • Fibrosis

    • Effusion

    • Pneumonia

    • Pleural thickening

    • Cardiomegaly

    • Nodule

    • Mass

    • Hernia

  • Date of data release:


  • Date of data download on M3:


  • Data access process:

    As advised here the NIH Chest X-ray dataset is immediately available to users and does not require a data usage agreement. To access this data on M3, please visit this link.

  • Location on M3:



Human Connectome Project Dataset (HCP): HCP-1200

  • Brief description:
    Copy of the following HCP data release is available on M3:
    • The 1200 Subjects Release (S1200) that includes behavioural and 3T MR imaging data from 1206 healthy young adult participants (1113 with structural MR scans) collected in 2012-2015. In addition to 3T MR scans, 184 subjects have multimodal 7T MRI scan data and 95 subjects also have some resting-state MEG (rMEG) and/or task MEG (tMEG) data available. For the first time, 3T MRI and behavioural retest data for 46 subjects is also available.

  • Date of data release:


  • Data access process:

    This data is restricted to M3 users that have their email registered with HCP (i.e. an account) AND have accepted the HCP terms and conditions associated with the datasets. The following outlines the process of getting access to this data:

    • Create an Account at HCP
      1. Follow the “Get Data” link to:

      2. Select “Log in to ConnectomeDB”

      3. Select “Register”

      4. Create an account and make sure that your email address matches the email associated with you MASSIVE account

      5. You will recieve an email to validate your account

    • Accept HCP data use Terms and Conditions
      1. Login to

      2. Accept terms and conditions using the “Data Use Terms Required” button on all 3 datasets (WU-Minn, WU-Minn HCP Lifespan Pilot and MGH HCP Adult Diffusion)

    • Request Access to HCPdata at MASSIVE

      If you have completed the Human Connectome Project steps above you can request access with this link: HCPData. We will verify your MASSIVE email against the HCP site and grant access.

  • Location on M3:

  • Tips on using data on M3:
    • HCP-1200 data is in the compressed format. We have put together an example script which is reasonably optimised to work with compressed data on High-Performance Computing. Please see the script in the following CVL GitHub repository. Please consider relevant notes/comments in the file. Using this script and provided hints, HCP data can be uncompressed in a reasonable time without storing files in project’s scratch directory.

    • You can create a symbolic link to the file location /scratch/hcp1200 in your home directory using:

ln -s /scratch/hcp1200 ~/hcp1200
  • Link to the source:
  • Notes to consider:
    • HCP-1200 data version that is on M3 has major issues with preprocessed 7T fMRI data, as is recognised and reported by the HCP community here. Hence, we highly recommend to not use 7T fMRI data in this data collection. We will remove the questionable data from this data collection and replace it with reprocessed 7T fMRI data that was released in 2018. This is a version that has fixed issues as described in this link. We will update this document when the update is complete.

    • The Connectome Workbench is available via the MASSIVE Desktop. Any updates to this or other software requirements can be directed to

Brain Genomics Superstruct Project (GSP)

  • Brief description:

    Large scale imaging data sets are necessary to address complex questions regarding the relationship between brain and behavior. The Brain Genomics Superstruct Project Open Access Data Release exposes a carefully vetted collection of neuroimaging, behavior, cognitive, and personality data for over 1,500 human participants. Each neuroimaging data set includes one high-resolution Magnetic Resonance Imaging (MRI) acquisition and one or more resting-state functional MRI acquisitions. Each functional acquisition is accompanied by a fully-automated quality assessment and pre-computed brain morphometrics are also provided. The imaging data are stored in 10 separate tar files, each containing 157 subjects. There is a single description .csv file that contains the demographic and phenotype data for all 1570 unique subjects. All 10 tar files have been downloaded to obtain the full n=1570 dataset. Also, tar files are uncompressed to help users with data processing.

  • Data version:


  • Date of data release:


  • Date of data download on M3:


  • Data access process:
  • Location on M3:

  • Tips on using data on M3:

    You can create a symbolic link to /scratch/gsp/gsp-20200910 in your home directory using:

ln -s /scratch/gsp/gsp-20200910 ~/gsp-20200910

Nathan Kline Institute Rockland Sample (NKI-RS): Neuroimaging Release

  • Brief description:

    The Rockland Sample is currently comprised of data from four studies, please see here for information about those studies. NKI-RS Neuroimaging Release contains imaging data, physiological data acquired during scan acquisition (cardiac and respiratory), and limited phenotyping (age, sex, and handedness). No psychiatric, cognitive, or behavioral information is included. here you can find more information about scans that are included for subjects in the Cross-Sectional Lifespan Connectomics Study, Longitudinal Developmental Connectomics Study, and Mapping Interindividual Variation In The Aging Connectome studies. The latest data release of raw data organized in the BIDS format. This folder includes data from all the releases. Since BIDS makes provisions for phenotypic and data collected during scanning (physiological,event-related), this data is also included in this folder in addition to the MRI series NifTIs. DICOMs are not included.

  • Data version:


  • Date of data release:

    December 2019

  • Date of data download on M3:


  • Data access process:

    As advised here, Neuroimaging Release is immediately available to users and does not require a data usage agreement. To access this data on M3, please visit this link to register your data access request.

  • Location on M3:

  • Tips on using data on M3:

    You can create a symbolic link to /scratch/nkirs-ndr/RawDataBidsLatest_20200819 in your home directory using:

ln -s /scratch/nkirs-ndr/RawDataBidsLatest_20200819 ~/nkir-ndr/RawDataBidsLatest_20200819



  • Brief description:

    BLAST search pages under the Basic BLAST section of the NCBI BLAST home page use a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases are made available as compressed archives of pre-formatted form. The FASTA files reside under the /FASTA subdirectory.

  • Data access process:

    There is no restriction to access this database. M3 users can access this data by registering their acceptance of the terms of access on M3

  • Location on M3:


Requesting a data collection

MASSIVE is interested in hosting reference data or collections that are valuable to the community. If you would like us to consider hosting a data collection please email with the following details:

  • data collection name

  • URL for download

  • size of data

  • urgency

  • if there are any ethics or licensing restrictions we need to impose

  • details of the community or other users who would find this collection useful