CosmoFlow Datasets

CosmoFlow Datasets

N-body cosmological simulation data for machine learning


This page describes the latest CosmoFlow dataset, which consists of data from around 10,000 cosmological N-body dark matter simulations. The simulations are run using MUSIC to generate the initial conditions, and are evolved with pyCOLA, a multithreaded Python/Cython N-body code. The output of these simulations is then binned into a 3D histogram of particle counts in a cube of size 512x512x512, which is sampled at 4 different redshifts. More details on the process of generating these datasets can be found in the original CosmoFlow paper.

The governing cosmological parameters of interest in this dataset are varied uniformly around a mean value with a 30% spread. For the purpose of machine learning, it is convenient to have normalized data/labels, so the labels corresponding to these cosmological parameters are stored both as normalized unit values within the range [-1, 1], as well as physical values which correspond to the actual numbers fed into the cosmological simulations. The mapping from the unit labels to the actual physical parameter values is given by P = m + U*h, where P is the actual physical parameter value, m is the mean physical parameter value being varied around, U is the unit parameter value, and h is the half-width of the spread.

All data for this set is stored in the HDF5 format, with one file per universe/simulation (1GB per file). The main directory for this dataset contains several subdirectories, which each contain their own subset of HDF5 files to reduce access times. Each HDF5 file has the following attributes and keys:

There are a total of 10017 universes, with four parameters varied over a 30% spread, and four redshift snapshots per universe:
Omega_m = 0.30 +/- 30%
Omega_L = 1 - Omega_m (flat Universe)
sigma_8 = 0.80 +/- 30%
N_spec = 0.96 +/- 30%
Omega_b = 0.045
H_0 = 70 +/- 30%
boxsize = 512*H_0/70
4 redshifts: [0.0 0.5 1.5 3.0]

Recommended Download Instructions

The recommended way to download data is through the Globus interface.

One can also use wget to download the data, despite it being considerably slower. This can be done from the command line.