StabLe : An algorithm for learning stable graphical models
1. You will find the following 4 folders with this software:
data - this folder contains adjacency matrices for 5 networks :
Alarm1_graph.csv, Barley_graph.csv, Child_graph.csv, Insurance_graph.csv and Mildew_graph.csv. These were downloaded from discovery lab (see link cited in the manuscript)
This folder also contains the file EMVar100 with processed HapMap3 gene expression data for top 100 ranked genes (see manuscript for ranking procedure). For the original data, please use the Array Express link cited in the paper.
source - this folder contains the source code and the makefile for compiling the code.
exe - this folder will contain the executable for StabLe upon compilation
output - this folder will contain the output files
2. What you need to compile and run this software:
g++, the GNU C++ compiler. We have tested this code on version 4.2.1
Eigen - A C++ library for linear algebra. A simple way of using Eigen with StabLe is to do the following :
Download a copy from eigen.tuxfamily.org
Untar the software and copy the subfolder "Eigen" to the "source" subfolder of StabLe. You can now compile using the makefile in the source folder.
3. Compiling the code:
cd to the source directory and just type,
> make all
and press enter
5. You will be asked to provide the following options upon execution:
MANDATORY INPUT :
- I1. Job name (any string to identify your job.) :
provide a name <job> to identify the output files, the string <job> will be suffixed to each output file.
- I2. Use Minimum dispersion criterion (enter 'y') or Ordinary least squares (enter 'n') for structure learning
- I3. True network known (enter 'y') or unknown (enter 'n')
- I4. Input data matrix provided (enter 'y') or use simulated data (enter 'n')
- I5. Perform bootstrap analysis (enter 'y') or cross-validation (enter 'n')
- I6. Number of variables :
provide the number of variables for the graphical model
- I7. Number of samples
provide the number of samples (simulated or user input)
- I8. Num random restarts ?
(provide the number of random restarts for searching the DAG space. we used 10 in the manuscript)
- I9. Num Bootstrap replicates/ Crossvalidation folds ?
(provide the number of bootstrap replicates/cross-validation folds you want to include. We used 100 replicates and 10 cv folds in the manuscript)
if you answered 'n' to the option I4, you will be asked to provide the following simulation parameters
- ISim1. Signal to noise ratio (rho) that measures the relative strength of the regression co-eff (we used 1 in the manuscript)
- ISim2. Shape parameter (0<alpha <=2) for stable noise
- ISim3. Skewness (-1<=beta<=1)
- ISim4. Dispersion (gamma>0)
if you answered 'y' to the option I3, you will be asked to provide the path to the adjacency matrix file.
- INet. Adjacency Matrix file name :
(provide a relative path and file name, for example ../data/Alarm1_graph.csv
See below for formatting instructions).
if you answered 'y' to the option I4, you will be asked to provide the path to the data matrix file
- IData. Data Matrix file name :
(provide a relative path and file name, for example ../data/EMVar100
See below for formatting instructions).
6. Adjacency matrix file format:
This file is a tab separated matrix with binary 0/1 entries. column j of row i is 1 if variable i is a parent of variable j and 0 otherwise.
you may find it helpful to look at the following files in the data folder
Alarm1_graph.csv, Barley_graph.csv, Child_graph.csv, Insurance_graph.csv and Mildew_graph.csv
7. Data Matrix file format:
Input files should be tab delimited with the following format:
Num of variables \tab Num of samples
Var1Name \tab Var1,Sample1_State \tab … (a tab separated list of the real values for Variable 1 in each sample. )
Var2Name \tab Variable2,Sample1_State … (a tab separated list of the state for Variable 2 in each sample )
.
.
.
(A row giving the data entries for each Variable. Var1Name, Var2Name are any identifiers)
you may find it helpful to look at the file HEX100 for the expression of 100 microarray probes.
8. Output files:
After execution you will find the following output files in the output directory,
BOOTSTRAP OUTPUT FILES:
If you performed bootstrap, the following output files will be created,
EdgeConfidence<job>
This is a tab delimited file giving the evolutionary probabilities for the top ordered Variables during OBS for each bootstrap replicate using the tree based search. Each row gives the probability for a given Variable being mutated and all other Variables being normal.
TruePositives<job>
This is a tab delimited file giving the number of bootstrap replicates where each true positive edge was found for structure learning with and without the tree.
FalsePositives<job>
This is a tab delimited file giving the number of bootstrap replicates where each false positive edge was found for structure learning with and without the tree.
MeanRegressionCoefficients<job>
This is a tab delimited file giving
simulations: the bias in mean regression co-efficients for the true positives
real data: mean regression co-efficients for each edge
VarRegressionCoefficients<job>
This is a tab delimited file giving
simulations: the variance about the true regression co-efficient for the true positives
real data: variance about mean regression co-efficients for each edge
Alpha<job>
Estimated alpha for each bootstrap replicate
Theta<job>
Estimated theta for the symmetrized data matrix obtained from each bootstrap replicate. Should always be close to zero.
Gamma<job>
Estimated log-gamma for each variable and for each bootstrap replicate
CROSS-VALIDATION OUTPUT FILES:
If you performed cross-validation, the following output files will be created,
CV_Scores<job>
Estimated test set error estimate LFLOM(T|B,p) (Figure 5B in manuscript) averaged over the cross-validation folds
CV_ALPHA<job>
Estimated alpha for each cross-validation fold
9. Cite as:
Navodit Misra and Ercan E. Kuruoglu.
Stable Graphical Models.
arXiv:1404.4351[cs.LG], (2014).
10. Copyright:
Copyright (c) 2014 Navodit Misra.
StabLe is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
StabLe is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with StabLe. If not, see <http://www.gnu.org/licenses/>.
11. Contact:
Navodit Misra
Max Planck Institute for Molecular Genetics,
Ihnestr. 63-73, D-14195 Berlin, Germany
misra@molgen.mpg.de