Home
Name Modified Size InfoDownloads / Week
galaxy_utils.py 2022-02-05 11.2 kB
catastrophic_outlier_binary_classification.ipynb 2022-01-22 10.5 kB
process_data_for_binary_classifier.ipynb 2022-01-21 6.7 kB
photoz_regression_mlp.ipynb 2022-01-21 8.3 kB
readme.txt 2022-01-15 5.4 kB
models.py 2022-01-15 2.1 kB
Totals: 6 Items   44.2 kB 0
This package contains Jupyter notebooks and supporting files which do the following:

 - Perform a neural network regression to estimate photo-zs
(photoz_regression_mlp.ipynb)

 - Take a data set with estimated photo-zs, set aside 30% of the galaxies as a base evaluation set, and output training sets for a binary classifier with varying portions of catastrophic outliers using the remaining 70% of the galaxies
(process_data_for_binary_classifier.ipynb)

 - Perform a neural network binary classification to determine catastrophic outliers given a data set with photometry and estimated photo-zs
(catastrophic_outlier_binary_classification.ipynb)

Supporting files:
galaxy_utils.py
models.py

This strategy of binary classification to identify catastrophic outliers is presented in “Machine Learning Classification to Identify Catastrophic Outlier Photometric Redshift Estimates.”  J. Singal, G. Silverman, E. Jones, T. Do, B. Boscoe, and Y. Wan, 2022, submitted (arXiv:2112.07811)

NB: When run in evaluation mode, these programs are currently configured to perform evaluations on testing sets with known actual (spectroscopic) redshifts.  'Blind' evaluations where actual redshifts are not known can be accomplished if desired by substituting dummy values for all entries in evaluation set columns which call for actual redshifts.
---------------------------------------------
photoz_regression_mlp.ipynb:

For training: 
Input: A .csv file with b+2 columns and n+1 rows, where b is the number of photometric bands (+ other possible features) and n is the number of galaxies.  The first column contains a unique integer index number for each galaxy, from 0 through n-1, not necessarily in numerical order.  The last column contains the known redshifts.

Output:A model file.

For evaluating: 
Inputs: 
1) A previously generated model file.
2) A .csv file with b+3 columns and n+1 rows, where b is the number of photometric bands (+ other possible features).  The first column contains the unique integer index number for each galaxy, from 0 through n-1.  The second-to-last column contains the known redshifts (or dummy values).  The last column contains the estimated redshifts.

---------------------------------------------
process_data_for_binary_classifier_training.ipynb

Input: A .csv file with b+3 columns and n+1 rows, where b is the number of photometric bands (+ other possible features) and n is the number of galaxies.  The first column contains a unique integer index number for each galaxy, from 0 through n-1, not necessarily in numerical order.  The second-to-last column contains the estimated photo-zs, while the last column contains the known redshifts.

Outputs: Several .csv files with b+4 columns and n+1 rows with b and as above.  The first column contains the unique integer index number for each galaxy, from 0 through n-1.  The third-to-last column contains the estimated photo-zs, the second-to-last column contains a flag (0 or 1) indicating whether a galaxy is a catastrophic outlier or not, while the last column contains the known redshifts.  
- a file with the original porton of catastrophic outliers containing 30% of the galaxies.  This is the 'base evalution set.' 
- files with different portions of catastrophic outliers in which non-catastrophic-outliers are removed randomly from the 70% of galaxies that do not overlap the base evaluation set.

---------------------------------------------
catastrophic_outlier_binary_classification.ipynb

For training:
Input: A .csv file with b+4 columns and n+1 rows, where b is the number of photometric bands (+ other possible features) and n is the number of galaxies.  The first column contains a unique integer index number for each galaxy, from 0 through n-1, not necessarily in numerical order.  The third-to-last column contains the estimated photo-zs, the second-to-last column contains a flag (0 or 1) indicating whether a galaxy is a catastrophic outlier or not, and the last column contains the known redshifts.  

Output: A model file.

For testing on the base evaluation set:
Inputs: 
1) A model file.
2) A .csv file with b+4 columns and n+1 rows, where b is the number of photometric bands (+ other possible features) and n is the number of galaxies.  The first column contains a unique integer index number for each galaxy, from 0 through n-1, not necessarily in numerical order.  The third-to-last column contains the estimated photo-zs, the second-to-last column contains a flag (0 or 1) indicating whether a galaxy is a catastrophic outlier or not, and the last column contains the known redshifts.  In most circumstances this should be the base evaluation set.

Output:
A .csv file with b+6 columns and n+1 rows, where b and n are as above.  The first column contains the unique integer index number for each galaxy, from 0 through n-1.  The fifth-to-last column contains the estimated photo-zs, the fourth-to-last column contains a flag (0 or 1) indicating whether a galaxy is a catastrophic outlier or not, the third-to-last column contains the known redshifts, the second-to-last column contains the value of the output neuron of the binary classifier, and the last column contains the prediction (0 or 1) of whether a galaxy is a catastrophic outlier based on a threshold of 0.5 in the output neuron.

  
Source: readme.txt, updated 2022-01-15