Home
Name Modified Size InfoDownloads / Week
README.txt 2018-07-13 7.1 kB
MOSGWA.1.2.4.zip 2018-07-13 355.0 kB
example.1.2.zip 2018-07-13 3.7 kB
Totals: 3 Items   365.8 kB 0
Welcome to MOSGWA

MOSGWA is a tool meant to be operated from the Linux command line.

In the top directory, you see:
README.txt	This overview information
INSTALL.txt	Build instructions
CHANGES.txt	Change log
COPYING.txt	GNU General Public License, which applies to this software
CMakeLists.txt	Top level configuration file for build with the cmake tool
src		Contains the C++ source and header files and a suitable makefile

Installation:
Follow the steps described in INSTALL.txt.

Running:
MOSGWA takes its configuration from files given on the command line.
You run MOSGWA with the command syntax:

MOSGWA config_file_name[s]

Config files look similar to Windows INI-files.
They determine files used, and any parameters for the search strategy,
in cases when the default values are not deemed optimal.
The following is an example:

[input]
plink_files = "random"
[data]
trait_index = 0
[output]
files = "random_out"
[single_marker]
test = cochran_armitage
[model_preselection]
mBIC_expected_causal_SNPs = 25
[model_selection]
selection_criterium = mBIC2
regression_type = firth
fast_multi_forward = false

You see the sections of the file headed by section headings, which are enclosed in square brackets [].
Within each section, the names of the parameters are unique.
You set parameters with an equals sign.

MOSGWA currently uses four types of parameters.
boolean (true or false)
integer (e.g. 0, 1, 2)
floating point (e.g. -9.3e3)
string (e.g. "random_out")

The [input] section must specify where to read the data from.

plink_files = "random"

specifies that the input format is plink's binary format, and the files to read are:
random.bim			contains information about SNPs
random.fam			contains information about individuals including the phenotype for one trait
random.bed			contains the fact table of genotypes
random.cov	if existing	contains additional covariates if there are any
random.yvm	if existing	contains phenotypes for additional traits if there are any

Concerning the file formats see:
* http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed
* http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml

The [data] section

trait_index = 0

specifies that the phenotype for the first trait should be taken. In plink format, it is contained in the file with suffix .fam.

The [output] section specifies where log- and other output files will be written with the option

files = "random_out"

This string will be used as prefix for the output filenames.

For fine-tuning model selection, the section [model_selection] contains options

expected_causal_snps_MBIC	lets the first step with the relaxed selection criterium consider models of up to about the given size.

Further useful options:

[log]
level			choice		DEBUG, INFO (default), WARNING, ERROR: how much to log

[input]
cache_limit		size_type	genotype data for how many SNPs may be cached in memory, default 1024

[output]
correlation_threshold	double		restricts the listing of closely correlated SNPs, default 0.999

[single_marker]
test			choice		chi_square, cochran_armitage ... which one to use

[memetic_algorithm]
modelsNo		size_type	memetic population size, how many models are held as memetic pool, default 10
fastForwardModelsNo	size_type	initial population using Artur's exclusived fast forward algorithm, must be less or equal modelsNo-1, default 5
tournamentSize		size_type	cardinality of the random subset chosen from which the best one is taken as winner, default 2
pCross			double		probability of memetic crossing-over, default 0.9
pMutation		double		probability of memetic mutation, default 0.05
pLocalExchangeTrial	double		probability for each neighbouring SNP in a local improvement step to be tested, default 1.0 meaning exhaustive
maxNoProgressIter	size_type	count limit for child generation events without noteworthy improvement, default 5000
B			size_type	if child is among B best (noteworthy improvement), reset MA counter, default 10
initialModelHaircut	size_type	by how many SNPs a model in the initial memetic population may exceed the model from the greedy algorithm, default 2
multi_forward_step_max	size_type	in memetic algorithm, bounds the number of SNPs in the initial fast forward generated memetic sub-population; default 1
maxPoolSize		size_type	maximum number of pairwise different models to be calculated (not to be confused with the memetic population size (variable name to be improved)) search exits when this is reached, default 200000
npProbabilityFunction	expression	the function of SNP count n and P value p to influence the probability of a SNP to end up in a random initial memetic population member, default 1/p
heritability_confidence_threshold	double	for calculating confidence interval, default 0.95
populationSnapshotInterval	size_type	log a snapshot of the current population periodically after n iterations unless n=0, in which case no snapshots are logged (default: 0)

[model_preselection]
mBIC_expected_causal_SNPs	integer		parameter for first round of model search (with mBIC), which determines the starting point for the actual model search (default: 60)

[model_selection]
regression_type		choice		linear (default), firth: to use for calculating the model selection criterium
selection_criterium	choice		BIC, EBIC, mBIC, mBIC2 (default), AIC, mAIC, mAIC2: criterium to use in second round of model selection
mBIC_expected_causal_SNPs	integer		parameter for second round of model search, when mBIC is chosen, irrelevant for mBIC2 (default: 4)
EBIC_gamma		double		parameter in the EBIC criterium, defaults to 1 - log( #individuals ) / ( 2 * log( #SNPs ) )
mAICx_C			double		parameter in the AIC, mAIC and mAIC2 criteria, defaults to 2.0
mBIC2_E			double		parameter in the BIC2 criterium in second round of model selection, defaults to 4.0
search_strategy		choice		greedy (default), memetic_algorithm
maximalModelSize	size_type	limits the search to models of size up to the given; saves time (default 35)
PValueBorder		size_type	only so many SNPs are considered in multi-forward steps, ranked by p-value
forward_step_max	size_type	bounds the number of SNPs in the forward step from the empty model; default 1
fast_multi_forward	boolean		whether to use fast multiple forward instead of normal multiple forward
					(that means: within multi-forward step use linear regression regardless of regression_type)
nSNPKriterium		integer		useful for running with a subset of top-ranking SNPs: the original number #SNPs of SNPs, to be used by the selection criteria; not used if 0, which is the default

Upon successful run, you will find (assuming output filename prefix "random_out") files with the names

random_out_IT.txt			results from individual SNP tests
random_outYvecout			states the phenotype vector used
random_out.mod				describes the chosen model
random_out.log				log file from the search
random_out_0the_result_Corr.txt		information about SNPs which are highly correlated to those in the model
random_out0the_resultCorr.h5		similar, but in HDF5 format
Source: README.txt, updated 2018-07-13