Download Latest Version trinculo_v0.96.tar.gz (438.7 kB)
Email in envelope

Get an email when there's a new version of Trinculo

Home
Name Modified Size InfoDownloads / Week
SpeedTests 2016-01-02
README.txt 2016-02-02 24.5 kB
trinculo_v0.96.tar.gz 2016-02-02 438.7 kB
trinculo_v0.95.tar.gz 2016-01-02 435.1 kB
trinculo_v0.9.tar.gz 2015-10-14 421.2 kB
trinculo_v0.8.tar.gz 2015-06-11 293.9 kB
trinculo_v0.7.tar.gz 2015-05-28 285.1 kB
trinculo_v0.5.tar.gz 2014-06-25 238.9 kB
Totals: 8 Items   2.1 MB 0
###############################
##### trinculo v0.96, January 2016 ######
###############################

Welcome to trinculo v0.96. This program contains tools to carry out multicatagory regression analyses, including multinomial and proportional odds ordinal logistic scans, conditional multinomial analysis, Bayesian multinomial logistic regression and multinomial fine-mapping. 

Table of contents of this README.txt:

1. Installation [information on how to install or compile the code]
2. USAGE AND OPTIONS
3. EXAMPLES [Worked examples for the main trinculo binary]
4. doMS.py script [Information and examples about more detailed model selection analyses]
5. Simulating multinomial data [Information and examples on the simulation capabilities of the software]
6. OUTPUT FORMATS [Documentation for the different file formats the software produces]
7. Contact, thanks, etc

This program was written by Luke Jostins. For help, bugs, comments or suggestions, please email me at lj4@well.ox.ac.uk

## Changes since v0.95

- Started keeping track of changes between versions
- Made output precision a user-settable parameters
- Made model fitting parameters user-settable
- Added absolute tolerance halting to ordinal model fitting (to fix bug where small parameter values were mistaken for non-covergence)
- Fixed a bug in the fine-mapping script that gave nonsense results in the presence of missing data
######################
##### 1. Installation #######
######################

There are precompiled 64-bit Mac and linux binaries in the "bin" directory, ending with the suffix "_osx" and "_linux" respectively. Rename the relevant one to remove the suffix and add them to your path to install. 

If they do not run, or if you want to recompile the binaries the source code is in the "src" directory. On Mac, compile with the Accelerate framework:

>g++ -framework Accelerate -Isrc -o bin/trinculo src/trinculo.cpp

On Linux, compile with lapack and the LINUX compiler flag:

>g++ -DLINUX -Isrc -o bin/trinculo src/trinculo.cpp -llapack -lpthread

After compiling, you should run the test script from the root trinculo directory and check that everything checks out (all tests should say PASSED):

> python bin/testing.py

##############################
##### 2. USAGE AND OPTIONS #######
##############################


Usage:
 trinculo MODE  [--bfile GENOTYPES_PREFIX | --dosage GENOTYPES.DOSE ] [--out OUTPUT_PREFIX] --pheno PHENOTYPES.TXT --phenoname PHENOTYPE [--basepheno REF --missingpheno MISSING ] [--covar COVARIATES.TXT --normalize] [--condition SNPS.TXT] [--priors PRIORS.TXT | --defaultprior --priorsigma PRIORVAR --priorcor PRIORCOR --covarsigma PRIORVAR | --empiricalprior  ] [ --select --phenops --errormat --fullmodel --waldstats ] [ --noassoc ] [--threads n] [ --precision DIGITS ] [ --fitIter NITER --fitAbsTol ABSTOL --fitRelTol RELTOL ] [--help]


Options:

MODE	- Either multinom or ordinal for multinomial or ordinal logistic regression

--bfile FILE_PREFIX	- Genotype input in binary plink format (FILE_PREFIX.bed, FILE_PREFIX.bim and FILE_PREFIX.fam)
--dosage FILE	- Genotype input in dosage format
--out PREFIX	- The prefix used for output files [default is "trinculo"]

--pheno FILE	- Phenotype input (same format as plink)--phenoname STRING	-The phenotype in the phenotype file to analyse
--basepheno STRING	- The reference catagory to use for the multinomial analysis [defaults to first catagory encountered]
--missingpheno STRING	- The symbol to be use to indicate a missing phenotype in the phenotype file

--condition FILE	- A file containing a list of SNPs (one per line) to conditon on (i.e. include as covariates)

--covar FILE	- A file containing covariates to condition on (same format as plink)
--normalize	- Normalizes covariates to have a variance of 1 [off by default]

--priors FILE	- A prior covariance matrix on effect sizes
--defaultpriors	- Construct a prior matrix based on default values
--priorsigma VALUE	- Set the prior variance on effect sizes within a disease[default is 0.04]
--priorcor VALUE	- Set the prior correlation on effect sizes across diseases [default is 0]
--covarsigma VALUE	- Prior on the covariate effect size [Default is 1]
--empiricalprior	 - Calculate an empirical prior directly from the data
--noassoc	-do not do any association testing, just infer the prior

--select	-Do Bayesian model selection (i.e. calculate marginal likelihoods for each sharing model)
--phenops	-Do a likelihood ratio test on each phenotype individually
--waldstats	-Calculate standard errors, Z scores and Wald p-values
--errormat	-Output the variance-covariance matrix for the log odds ratio
--fullmodel	-Output all parameter estimates and the full error matrix

--threads n	Use up to n cores

--precision DIGITS	Print output to DIGITS significant figures

--fitIter NITER	Try up to NITER iterations when fitting models
--fitAbsTol ABSTOL	Terminate model fitting if results change by less than ABSTOL
--fitRelTol RELTOL	Terminate model fitting if results change by less than (RELTOL x value)

--help	- Prints this message


Note on the prior covariance matrix: The ordering of the phenotypes (i.e. which phenotype each row/column in the prior covariance matix corresponds to) is the order that they are first encountered in the phenotype file (excluding the basepheno, if set). If in doubt, run a frequentist association test first, see what order the odds ratios are given in in the output file and use that ordering in the prior covariance matrix. I will come up with a better way of dealing with this soon.


#######################
##### 3. EXAMPLES #########
#######################

All example files can be found in the "examples" directory of the source code.

## Example1: run a normal multinomial scan on plink files##

To run a straight-forward multinomial scan, use the command:

> trinculo multinom --bfile genotypes --pheno phenos.txt --phenoname Pheno --basepheno Control --out example1

This reads in phenotype catagories from the column headered "Pheno" in phenos.txt, as well as genotypes in plink format from genotypes.bed, genotypes.bim and genotypes.fam. Output example1.assoc.multinom contains one line for each SNP, with estimated odds ratios for each phenotype, and overall likelihood and LRT p-values. Note that it also outputs a log file, "example1.log". 


## Example 2: run a multinomial scan on dosage files ##

We can also read in dosages, rather than plink files:

> trinculo multinom --dosage genotypes.dose --pheno phenos.txt --phenoname Pheno --basepheno Control --out example2



## Example 3: run a multinomial scan with covariates##

> trinculo multinom --bfile genotypes --pheno phenos.txt --phenoname Pheno --basepheno Control --covar pcs.txt --out example3

Reads in plink and phenotype files as above, but this time covariates are read in from pcs.txt, and included in the regression model to correct for population stratification. Notice that SNP4 in example3.assoc.multinom no longer shows evidence of association, suggesting that the p-value in example1 was driven by population stratification.


## Example 4: test conditional association ##

The output from the previous example showed a number of SNPs with small p-values close by. We can test to see whether these signals are independent by conditioning on the most significant SNP by creating a condition.txt file and reading it in along with the rest of the data: 

> echo "SNP1" > condition.txt
> trinculo multinom --bfile genotypes --pheno phenos.txt --phenoname Pheno --basepheno Control --covar pcs.txt --condition condition.txt --out example4

Notice that the p-values for SNP2 and SNP3 in example4.assoc.multinom are now non-significant, suggesting that signals in this data were driven by a combination of SNP1 and population stratification. 


## Example 5: do a Bayesian multinomial scan ##
   
Instead of doing a frequentist likelihood ratio test, we can instead place priors on the effect sizes are use these to calculate Bayes factors:

> trinculo multinom --bfile genotypes --pheno phenos.txt --phenoname Pheno --basepheno Control --covar pcs.txt --normalize --defaultprior --out example5

Note that example5.assoc.multinom has log Bayes factors, rather than p-values, for each SNP.

The default priors assume effect sizes are a-priori independent with variances of 0.04 (equivilent to the SNPtest Bayesian prior of sigma=0.2). You can change the priors using --priorsigma and --priorcor (for the variance and correlation in effect sizes), or by setting a more complex prior using a text file:

> trinculo multinom --bfile genotypes --pheno phenos.txt --phenoname Pheno --basepheno Control --covar pcs.txt --normalize --priors priors.txt --out example5 

Here we also use --normalize to ensure that the covariates have a variance of 1: there is also a prior on the covariates, and this is dependent on the variance of the covariates, so normalizing ensures that the prior behaves as it should. Use the option --covarsigma if you want to set the covariate prior yourself.


## Example 6: do Bayesian multinomial fine-mapping ##

The output of the Bayesian multinomial regression in example5 can be used to carry out fine-mapping (looking at only the first three variants, which are in the same locus):

> head -4 example5.assoc.multinom  > example5_chr1.assoc.multinom 
> doFM.py example5_chr1.assoc.multinom > example6.finemap

The fine-mapping file is exactly the same as the multinomial regression output, except that it has a final column, "P_CAUSAL", that gives the probability that each SNP is the causal variant (on the assumption that exactly one causal variant exists in the file). In our case, example6.finemap shows that SNP1 is almost certainly the causal variant, as it has a P_CAUSAL of nearly 1.

NOTE: The fine-mapping assumes that there is equal data available on each SNP (i.e. there is very little missing data). It will still give results if there is missing data, but will tend to give SNPs with missing data less weight.  

## Example 7: do Bayesian model selection ##

Trinculo can search through different genetic models of a trait to pick out the best one:

> trinculo multinom --bfile genotypes --pheno phenos.txt --phenoname Pheno --basepheno Control --covar pcs.txt --normalize --defaultprior --select --out example7

As well as the usual association file, this also generates a file called example7.select.multinom. This has the marginal likelihood for each model as a seperate columns - smaller numbers are better. These results can be proccessed into a number of more easily intepretable forms using the "doMS.py" script, which is described in the next section.

Notice that these results can be dependent on the prior that you choose - try running the same test with --priorcor 0.9 (i.e. assuming that the effect sizes are highly correlated between phenotypes).

If you have a larger number of variants (above 25 or so) trinculo can learn an empirical prior directly from the data. The result will be somewhat inaccurate in thise case (as we only have 5 SNPs), but you can try anyway:

> trinculo multinom --bfile genotypes --pheno phenos.txt --phenoname Pheno --basepheno Control --covar pcs.txt --normalize --empiricalprior --select --out example7.2

This will write "example7.2.prior", which will contain the estimated empiricial prior, as well as "example7.2.select.multinom", with the marginal likelihoods for each model.

## Example 8: run an orderinal logit scan ##

As well as a multinomial regression, we can also run a more restricted form of regression that assumes that phenotypes are ordered.

> trinculo ordinal --bfile genotypes --pheno phenos.txt --phenoname Order --covar pcs.txt --out example8

Reads in ordered phenotypes with column headering "Order" in phenos.txt, and runs a proportional odds ordinal regression. Note that the Order column only contains integers 0 to 2 - ordinal regressions should only be carried out on integers 0 to N-1, where N is the number of catagories. Output example7.assoc.ordinal contains the estaimted proportional odds, as well as likelihoods and LRT p-values.


#######################
##### 4. doMS.py script #########
#######################

In the previous section, we showed how Trinculo can generate marginal log-likelihoods for each sharing model

> trinculo multinom --bfile genotypes --pheno phenos.txt --phenoname Pheno --basepheno Control --covar pcs.txt --normalize --defaultprior --select --out example7

The file this produces, example7.select.multinom, contains a column for each possible sharing model: "M_Disease1" indicates the model where the SNP is association with Disease1 but not Disease2, "M_Disease2_Disease1" indicates a model where the SNP is associated to both phenotypes. This generalizes to arbitrarily numbers of phenotypes (e.g. "M_A_B_D" indiates a model where the SNP is associated with phenotypes A, B and D but not any of the other phenotypes analysed). "NULL" is the null model, where the SNP does not effect any of the phenotypes.

The marginal likilihood alone is hard to interpret. We provide a processing script, "doMS.py", which carries out more detailed model selection analyses based on the marginal likelihoods. The 

## Mode 1: Bayes factor normalization ##

The most simple mode (mode 1, the default mode) merely normalizes the marginal likihoods by the marginal likelihood of the NULL model, so each column becomes the log Bayes factor relative to the NULL model:

> doMS.py --mode 1 example7.select.multinom > example7.ms.BFs

Looking at this example7.ms.BFs file, you can see that the Bayes factor for the null model is 0 for every SNP. For SNP1, the Bayes factors are large and positive for M_Disease1 and M_Disease2_Disease1 (indicating that both of these models are better than the null model), but negative for M_Disease2 (sharing that this model is worse than the null). The best model is a sharing model  

## Mode 2: Posterior normalization ##

The second mode calculates posteriors on each model (the probability that each model is the correct one, given the data and the prior). To do this, we must specify a prior on models (i.e. a prior probability that each sharing model is true). The simplest prior (which is also the default) is a uniform prior on models, where each sharing model is given the same probabilty, specified with the flag --flatmodelprior:

> doMS.py --mode 2 --flatmodelprior example7.select.multinom > example7.ms.posteriors.1

The posterior probabilities can be read off from the columns in example7.ms.posteriors.1 - for SNP1, the best model (M_Disease2_Disease1) has a posterior of 85.2%. 

One thing to be aware of is that there is no "right" choice of prior. The uniform prior on models may seem sensible, but it has some strange properties. Notably, it predicts that the number of phenotypes that each SNP is associated with is binomially distributed under the prior, with a mean half the number of phenotypes. We might instead rather that the prior was uniform on the number of associated phenotypes (so the sum of the priors on models that are associated with 3 phenotypes is the same as the number associated with 5, for example). We can use this prior by selecting --flatphenoprior

> doMS.py --mode 2 --flatphenoprior example7.select.multinom > example7.ms.posteriors.2

This prior favours sharing more strongly, and as a result gives the posterior for the shared model on the same SNP as 98.9%, rather than 85.2% for the flat model prior. 

Finally, if you have some reason to favour certain models over others, you can give a user-specified file with a prior probability  for each model(column 1 is the model name, and column 2 is the value of the prior). Specify this with the --priorfile command:

> doMS.py --mode 2 --priorfile model_priors.txt example7.select.multinom > example7.ms.posteriors.3

This prior gives the sharing model a posterior of 98.5% for SNP1. Note that each prior gives a slightly different posterior, and there is no "right answer" in this case. You should try out different priors and see how many of your results change - results that are similar between . For instance, in this case each prior agrees that SNP1 is most likely to be shared across the two diseases (though they disagree on the strength of the evidence). However, for SNP5, two of the priors give the highest posterior to the sharing model, whereas the flat model prior gives a (slightly) higher posterior to the disease1-specific model.

## Mode 3: Phenotype specificity report ##

The third mode simplified the models to pick the best disease-specific model, the best sharing model, and gives the Bayes factor between the two. 

> doMS.py --mode 3 example7.select.multinom > example7.ms.specificity

For SNP1, you can see strong evidence in favour of sharing (i.e. a negative log-Bayes factor of specificity). For SNP5, you can see weak evidence (logBF of 0.1) in favour of the SNP being specific to Disease2. 

## Mode 4: Credible set report ##

The fourth and final mode takes in a model prior and generates credible sets i.e. sets of models that include 95% of the posterior, or, to put in another way, the smallest set of models that has at least a 95% chance of including the true model. 

> doMS.py --mode 4 --flatphenoprior example7.select.multinom > example7.ms.credible

For SNP1, the output file shows that the most model is the sharing model, that this model has a posterior of 98.9%, and the credible set includes just this model. For SNP2, SNP3 and SNP4 the credible sets are larger, and include the NULL model (i.e. the plausible set of models includes those that have no associations at all). SNP5 shows another well-resolved model, where 97.9% of the posterior is on one model (but again, try changing the prior and seeing how it impacts the report...).


#######################
##### 5. Simulating multinomial data #########
#######################

You can compile the simulation binary from the trinculo root directory using the command:

For OSX: > g++ -framework Accelerate -Isrc/ -o bin/sims src/sims.cpp 
For Linux: > g++ -DLINUX -std=c++0x -Isrc -o bin/sims src/sims.cpp -llapack

The format to use the command is:

> sims configfile.txt outprefix

The simulations read in a config file that specifies the parameters to use for the simulations, and outputs the simulated data in binary plink format to outprefix.bed/outprefix.bim/outprefix.fam. 

Two example config files, with comments explaining the settings, are included in the example directory. 

Example 1 (simconfig1.txt): generates data on 100 variants from four phenotypes (2000 samples each), plus controls (2000 samples), all with an allele frequency of 0.5 and odds ratios of 1.2 for two of the phenotypes (with no association for the other two). It also generates 5 confounders that are associated with both genotype and phenotype. Run this example, and analyse the results with a frequentist assocation test, using:

> sims simconfig1.txt sims1
> trinculo multinom --bfile sims1 --pheno sims1.pheno --phenoname PHENO --covar sims1.covar --out sims1

You could then ask, for example, what power you have to detect this association at genome-wide significance (p < 5*10^-8). This should be about 50%.

Example 2 (simconfig2.txt): generates data on 100 variants from thre phenotypes (1000 samples each), plus controls (4000 samples), all with an allele frequency of 0.5 and odds ratios drawn from a covariance matrix read in from the file "CovFile.txt". Run this example, and analyse the results with a Bayesian assocation test using the same covariance prior, using:

> sims simconfig2.txt sims2
> trinculo multinom --bfile sims2 --pheno sims2.pheno --phenoname PHENO --priors CovFile.txt --out sims2

You can then ask, for example, what power you have to detect an association drawn from this covariance prior with a log Bayes factor greater than 14 (it should be about 50%).



##############################
##### 6. OUTPUT FORMATS #######
##############################

Trinculo outputs various files that include information on each SNP. Note that the precise columns present in these output files depend on the analysis performed and the input data. This section describes the meaning of each column in the various files output.

##Common columns across all formats
CHROM - chromosome of variant [Only present if run on binary plink input]
POS - position of variant [Only present if run on binary plink input]
RSID - The identifier for this variant
A1 - The first allele [Only present if run on binary plink input]
A2 - The second allele (the risk allele, for odds ratios) [Only present if run on binary plink input]

## Assocation output: PREFIX.assoc.multinom 
This file format describes the results of omnibus statistical tests (frequentist or Bayesian), reporting the evidence of association at this locus across all phenotypes.

OR_{pheno} - An odds ratio for the phenotype {pheno} for this variant (there is one for each phenotype)
L0 - The log null maximum likelihood [For Frequentist analyses only]
L1 - The log alternative maximum likelihood [For Frequentist analyses only]
PVAL - The likelihood ratio test p-value for this variant [For Frequentist analyses only]
L+P0 - The log null marginal likelihood [For Bayesian analyses only]
L+P1 - The log alternative marginal likelihood [For Bayesian analyses only]
logBF - The log Bayes factor between the null and alterntaive models for this variant [For Bayesian analyses only]
P_CAUSAL - The probability that this variant is the causal variant at this locus [For fine-mapping analyses only]

## Model selection output: PREFIX.select.multinom 
The file format describes the results of a model selection analysis, reporting the marginal likelihood for each sharing model.

NULL - The marginal likelihood of the null model (i.e. the model with no associations to any phenotype categories)
M_{model} - The marginal likelihood for a sharing {model}. The model name consists of all of the phenotypes included in this sharing model, seperated by an underscore ("_"). 

## Wald stats/ per-phenotype LRT outputs: PREFIX.ses.multinom or PREFIX.phenop.multinom 
This file format reports the frequentist evidence of association to each phenotype individually (either by rapidly-calculate Wald statistics from the full model fit, or a slower likelihood ratio test for each individual phenotype).

SE_{pheno} - The standard error on the log-odds ratio for phenotype {pheno} [Only present for Wald tests]
Z_{pheno} - The Wald statistic (Z-score) for phenotype {pheno} [Only present for Wald tests]
P_{pheno} - The p-value of association for phenotype {pheno}

##Error covariance matrix output - PREFIX.err.multinom
The variance-covariance error matrix on the log-odds ratios across all phenotypes

{pheno1}_{pheno2} - the entry of the error matrix for phenotypes {pheno1} and {pheno2}

## Full model output - PREFIX.fullmd.errs and PREFIX>fullmd.pars
This stores the full model fit (parameters and error covariance matrix) for each variant.

{pheno}_intercept - the intercept value for phenotype {pheno} [in pars file]
{pheno}_geno - the per-genotype-dose increase in odds ratio phenotype {pheno} [in pars file]
{pheno}_{interept/geno}_{pheno}_{interept/geno} - the entries of the error variance-covariance matrix for two parameters. [in errs file]


#############################
##### 7. Contact, thanks, etc #######
#############################

Trinculo was written by Luke Jostins, with help and input from Isabelle Cleynen, Jeff Barrett, Gil McVean, Hailiang Huang, Mark Daly, Yang Luo and Jessie Li.

Contact: For help, bugs, suggestions and the like, email Luke Jostins <lj4@well.ox.ac.uk>

Source code is available from http://sourceforge.net/projects/trinculo/

Licensed under GPL3:

 trinculo, Flexible tools for multinomial association
  Copyright (C) 2015  Luke Jostins

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.*/
Source: README.txt, updated 2016-02-02