SDSS Galaxy Datasets



Data Availability

We are releasing the data used in our work, so that others may perform similar investigations and improve on our models. Currently data is hosted at the NERSC web portal, and available downloads are listed below:

Full unlabeled dataset:
File path: https://portal.nersc.gov/project/dasrepo/self-supervised-learning-sdss/datasets/sdss_combined_train.h5
This 266GB dataset contains 1245254 images in an HDF5 file, under the following key (galaxies are indexed along the first dimension of array):

  • images: 5-band photometric image, with 107x107 pixels
Due to the size, we recommend using a tool like wget for downloading data. Usage is:
wget https://portal.nersc.gov/project/dasrepo/self-supervised-learning-sdss/datasets/sdss_combined_train.h5

Training dataset, redshift estimation:
File path: https://portal.nersc.gov/project/dasrepo/self-supervised-learning-sdss/datasets/sdss_w_specz_train.h5
This 85GB dataset contains 399982 samples in an HDF5 file, with the following atributes stored as dataset keys (galaxies are indexed along the first dimension of each array):

  • ObjID: Object ID from SDSS database
  • ra: Right ascension of galaxy
  • dec: Declination of galaxy
  • e_bv: Galactic reddening in SFD E(B-V) for the galaxy
  • specObjID: The SpecObjID for the galaxy, from SpecObj table in SDSS
  • specz_redshift: Spectroscopic redshift measurement for the galaxy
  • specz_redshift_err: Uncertinty on the spectroscopic redshift measurement for the galaxy
  • images: 5-band photometric image, with 107x107 pixels
Due to the size, we recommend using a tool like wget for downloading data. Usage is:
wget https://portal.nersc.gov/project/dasrepo/self-supervised-learning-sdss/datasets/sdss_w_specz_train.h5

Validation dataset, redshift estimation:
File path: https://portal.nersc.gov/project/dasrepo/self-supervised-learning-sdss/datasets/sdss_w_specz_valid.h5
This 22GB dataset contains 102993 samples in an HDF5 file, with the following atributes stored as dataset keys (galaxies are indexed along the first dimension of each array):

  • ObjID: Object ID from SDSS database
  • ra: Right ascension of galaxy
  • dec: Declination of galaxy
  • e_bv: Galactic reddening in SFD E(B-V) for the galaxy
  • specObjID: The SpecObjID for the galaxy, from SpecObj table in SDSS
  • specz_redshift: Spectroscopic redshift measurement for the galaxy
  • specz_redshift_err: Uncertinty on the spectroscopic redshift measurement for the galaxy
  • images: 5-band photometric image, with 107x107 pixels
Due to the size, we recommend using a tool like wget for downloading data. Usage is:
wget https://portal.nersc.gov/project/dasrepo/self-supervised-learning-sdss/datasets/sdss_w_specz_valid.h5


Data Selection Overview

Our database of galaxies is assembled from Data Release 12 of the SDSS. To pull samples with spectroscopic redshift labels, we follow the process of Pasquet et al. (2019) in pulling from the Main Galaxy Sample to enable direct comparison to their results. Their SQL query filters for objects classified as 'GALAXY' with dereddened petrosian magnitudes less than 17.8 and spectroscopic redshifts below 0.4. For us, the query returns 547,224 objects, and after removing some duplicates, we are left with 517,190 to use as labeled training examples. When fine-tuning our image representations for the photo-$z$ estimation task, we use 400,000 images for training and 103,000 as validation dataset.

To build our larger set of galaxies with no spectroscopic labels, we filter for objects with dereddened petrosian magnitudes less than 17.8, on the 'PhotoObjAll' full photometric catalog of the SDSS. In the resulting set of galaxies, we remove duplicates which were already included in our spectroscopic sample, and exclude samples with an estimated photometric redshift (as tabulated in the SDSS database) below 0.8. This eliminates objects which are too distant compared to the spectroscopic sample, but decreases the possibility that we are unnecessarily excluding samples whose true redshift is less than 0.4 (the cutoff for our spectroscopic sample) due to incorrect photo-z estimates. After imposing these cuts, we were able to successfully pull 845,254 objects. The spatial distributions of our labeled and unlabeled galaxy datasets are shown below.


SDSS photometric images contain data in 5 passbands (ugriz), and come background-subtracted but are not de-reddened to account for galactic extinction. To pull images for our datasets, we use the Montage tool to query the imagery catalog in SDSS Data Release 9 (DR9), based on the tabulated equatorial coordinates for each object in our dataset. For each set of object coordinates, we sample a patch of sky of size 0.012 square degrees, centered on the object, and project onto a 2D image with 107x107 pixels (this ensures the resulting pixel scale is as close as possible to the native pixel scale in the SDSS, 0.396 arcsec). In each image, we store the u, g, r, i, and z passbands as 5 color channels.

Note that during training of the self-supervised model, we impose random rotations and random jitter to each image before cropping out the central portion as a data augmentation, so while our photometric images contain 107 pixels per side, the CNNs in this work only view samples of size 64x64 pixels. This input size for the CNN is consistent with the photo-z CNN model of Pasquet et al. 2019.

Acknowledgements

Funding for SDSS-III has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, and the U.S. Department of Energy Office of Science. The SDSS-III web site is http://www.sdss3.org/. DSS-III is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS-III Collaboration including the University of Arizona, the Brazilian Participation Group, Brookhaven National Laboratory, Carnegie Mellon University, University of Florida, the French Participation Group, the German Participation Group, Harvard University, the Instituto de Astrofisica de Canarias, the Michigan State/Notre Dame/JINA Participation Group, Johns Hopkins University, Lawrence Berkeley National Laboratory, Max Planck Institute for Astrophysics, Max Planck Institute for Extraterrestrial Physics, New Mexico State University, New York University, Ohio State University, Pennsylvania State University, University of Portsmouth, Princeton University, the Spanish Participation Group, University of Tokyo, University of Utah, Vanderbilt University, University of Virginia, University of Washington, and Yale University.