Download Latest Version SPIDERz.zip (25.9 kB)
Email in envelope

Get an email when there's a new version of SPIDERz

Home
Name Modified Size InfoDownloads / Week
readme.txt 2017-03-09 12.3 kB
SPIDERz.zip 2017-03-09 25.9 kB
Totals: 2 Items   38.2 kB 0
SPIDERz (SuPport vector classification for IDEntifying Redshifts) is a customized support vector machine (SVM) package for photometric redshift estimation (photo-z) written for the IDL environment.  The package allows users to easily apply powerful SVM optimization and statistical learning techniques to custom data sets for the purpose of obtaining accurate photo-z estimations and effective probability distributions. Users may apply SPIDERz to traditional data sets consisting of photometric band magnitudes, or alternatively to data sets with additional galaxy parameters (such as shape information) in order to investigate potential correlations between the extra galaxy parameters and redshift.

SPIDERz is straightforward to use, proven effective for photo-z estimation, and implements SVM algorithms in a way that treats all input parameters equally.  The goal is to enable users, even those inexperienced with SVMs, to utilize the predictive power of SVMs with familiar IDL interface, and to potentially explore the effects of performing SVM photo-z analysis with additional galaxy parameters.  SPIDERz is based in part on algorithmic procedures implemented in LIBSVM, a library of SVM tools written in C++ and JAVA (Chang and Lin, 2011, ACM Trans. Int. Sys. Tech. 2, pp. 1-27). 

SPIDERz utilizes Support Vector Classification (SVC) with a radial basis function (RBF) kernel and optional parameter grid search with v-fold cross validation. It has been verified to work with IDL versions 8.3-8.5 and should be back compatible with reasonably recent previous versions.  An article about SPIDERz with further discussion of the algorithm and an example of a particular investigation with it is available at http://arxiv.org/abs/1607.00044 (Jones & Singal, 2017, A&A, in press).

Instructions for SPIDERz usage are included below. Most user interface is with main.pro (with one exception for randomizing training and evaluation sets as noted below).  Users are required to pass training and evaluation data sets as separate multi-dimensional arrays as arguments to the main program with a command line call, and may include optional arguments for performing a specified action, be it either training, evaluation, v-fold cross validation, parameter grid search -- or all of these processes, among other options.  Optional outputs include a vector with the most likely redshift for each galaxy (predicted_z) and an array with a probability distribution over redshift bin ranges for each galaxy. 


MAIN PROGRAM - main.pro
INPUT:
REQUIRED-
training_set_ - A p+1 by n_t array containing the training set galaxy parameters and known redshifts (p parameters plus known spectroscopic redshifts for n_t galaxies) OR if executing with a previously saved model then use load_model flag (see below) and pass in empty variable to training_set
evaluation_set_ - A p (or p+1) by n_e array containing the evaluation set galaxy parameters (p parameters and optionally plus known spectroscopic redshifts for n_e galaxies) OR if training only then use save_model flag (see below) and pass in empty variable to evaluation_set
num_params - specifies number of galaxy parameters, excluding spectroscopic redshift 

Explanation: Training and evaluation sets should be passed as p+1 by n_t or n_e arrays, where p is the number of galaxy parameters (photometric band magnitudes and potentially additional information such as morphological parameters) and n is the number of galaxies in each data set (n_t for the training set and n_e for the evaluation set). 

The first columns of each data set array should be comprised of photometric band magnitudes and the optional inclusion of additional parameters. Training sets are additionally required to include known spectroscopic redshift values (this is optional for evaluation sets), and they must occupy the last column in the multi-dimensional data set array. If spectroscopic redshifts are included in the evaluation set, the program calculates and prints the RMS error, number of outliers (defined as |z_phot - z_spec|/(1+z_spc) > .15), and class prediction accuracy. 

If predicting redshifts based on an already saved model see below.

OPTIONAL INPUT KEYWORDS:
predicted_z = predicted_z  : stores estimated photo-z values inside variable predicted_z after evaluation
pdf = 1 or 2 : calculate redshift probability distribution (PDF) for evaluation galaxies 
                           pdf = 1 uses the 'effective' PDF from the m(m-1)/2 1v1 binary classification problems
                           pdf = 2 uses the improved Platt probability method (see note below)
prob_distribution = prob_distribution : stores calculated PDF for each evaluation galaxy                                     
prob_z = prob_z : stores most probable redshift estimates using the PDF specified by prob_distribution (if pdf = 1 is used, prob_z will be the same as predicted_z) 
store_model = store_model : preserves predictive model obtained from training instead of performing evaluation 
load_model = load_model : skips training and uses predictive model for evaluation 
do_grid_search = 1 : performs a parameter grid search before training and evaluation 
do_cv = 1 : perform only v-fold cross validation 
nr_fold = v : specify v-folds for cross validation (default v = 2)
do_training = 1 : performs training and evaluation (default 1)
num_params = n : specifies n number of galaxy parameters, excluding spectroscopic redshift (default 5)
test_z_present = 1 : specifies if spectroscopic redshift is included in testing set. (default 0)
c : cost regularization parameter in cost function (default 32678)
gamma :  scaling parameter in RBF kernel (default 1)
plot = 1 : display plot of estimated photo-z vs known spectro-z 
eps : stopping tolerance for optimal hyperplane solution (default 1e-3)
binsize - redshift bin size (default .1)

OUTPUT: 
predicted_z - vector of length n_e of estimated simple discrete photo-z predictions (most common predicted redshift of the m(m-1)/2 1v1 binary classification problems).
prob_distribution - probability distribution of results with either the improved Platt method or the m(m-1)/2 1v1 binary classification problems solved during pairwise coupling evaluation process with m-classes as set above. Saved as a multidimensional m x n_e array, where m is the number of redshift bins and n_e is the number of galaxies in the evaluation set.  Values are percentage probability in each bin ranging from the lowest to highest redshift values in the training set.
svm_model - for cases where store_model = 1, the predictive model obtained from training is stored as an object in this variable.
If a grid search is performed, optimized C and gamma values are printed.
If the evaluation_set contains known spectro-z, the RMS error, outlier count, and class prediction accuracy from the evaluation process are printed.


EXAMPLE USAGE:
The following command performs training and evaluation (Predicted photo-z values (predicted_z) and probability distributions (prob_distribution) are preserved after evaluation) :
main, training_set, evaluation_set, predicted_z=predicted_z, pdf=2, prob_distribution=prob_distribution, do_training=1, num_params=5, test_z_present=1, gamma=.0675
num_params=5 indicates that there are 5 galaxy parameters (excluding redshift) in the training_set and evaluation_set arrays.
test_z_present=1 indicates that the known spectro-z values for galaxies in the evaluation set are included.
gamma=.0675 specifies a value to use for the scaling parameter in the RBF kernel.
--------------
Cross Validation:
The cross validation process guards against over-fitting the training set as well as approximates the performance of a predictive model on an unknown evaluation set by effectively training and evaluating the training set on itself. Cross validation randomly separates the training set into v-subsets of equal size, each of which sequentially serve as a pseudo-evaluation set for the predictive model created from a training galaxies in the remaining v-1 subsets.
It should be noted that we use 2-fold CV as the default number of folds in order to obtain a distribution of class values in the v-subsets that is sufficiently representative of the class distribution of the training set. With 17 total classes comprising a training set with 700 galaxies, we found that using a higher number of folds performed poorly. Separating the training set into v-folds tended to unevenly distribute the number of galaxies from each class, and even omit class values entirely for extreme cases of 5+ folds. However, for larger training sets and smaller number of classes, we recommend testing CV with a higher number of folds.
Upon completion of CV, the RMS error, outlier count, and class prediction accuracy are calculated and printed. The C and gamma values used during the CV process are printed as well.
--------------
Grid Search:
We provide the option of performing a parameter grid search to obtain optimal values of C and gamma for a given training set. The grid search process performs v-fold cross validation by looping over every possible combination of C and gamma within a specified range and with a specified step size. The particular values of C and gamma that result in the lowest RMS error among all iterations are preserved and used to subsequently train the entire training set. To expedite the runtime of the parameter grid search process while still searching over a wide range of possible values for C and gamma, we perform an exponential search and allow for the range of potential values, as well as the steps-size of each iteration, to be easily modified in gridsearch.pro. We recommend, especially for cases of large training sets, that users first perform a broad search over potential C and gamma values with a large step size (1 or greater) in order to obtain a rough estimate of the optimal ranges for each parameter, and performing subsequent grid searches over a narrower range of parameter values and with a finer step size.
If a grid search is specified in the command line call to main.pro, the default action upon completion is to perform training and evaluation with the optimized C and gamma values. However if store_model = 1 is specified in the command line call, only the training process is conducted and the predictive model is saved.
--------------
Advice on preparing the training and evaluation sets:

It is often helpful to scale parameters such as band magnitudes to be between 0 and 1 or other limits in order to reduce the effect of anomalous values and to treat parameters on an equal footing regardless of their natural range.  It is very important to scale training and evaluation sets in the same way; scaling the training and evaluation sets differently, especially for cases of small data sets and those with a significantly large disparity between the sizes of different galaxy parameter values, improper scaling results in a disproportional re-sizing of parameter magnitudes when normalized between 0 and 1. 
For scaling training and evaluation sets we recommend the publically available routine cgscalevector.pro by David Fanning.

Note that the determination of the Platt probability distribution function (option '2' for the pdf in the main.pro call) has limited accuracy unless the training set is balanced across classes (classes here being redshift bins), meaning that there are roughly similar numbers of training set galaxies in each redshift bin.  Galaxy data sets are typically distributed very unevenly in redshift.  For determination of the Platt probability distribution a training set should be constructed that has roughly similar numbers of galaxies in each redshift bin.

--------------
Advice on randomizing training and evaluation sets-
For the purposes of evaluating photo-z accuracy on data sets of known redshift, galaxies are often randomly placed into training and evaluation sets, for a number of reasons including to ensure that both are representative of the redshift distribution of the overall set.  We include a routine randomize.pro which randomly divides a galaxy dataset into training and evaluation sets, with specific instructions in the header of that file.

Source: readme.txt, updated 2017-03-09