Download Latest Version covec_distribution-v0.4.tar.gz (125.0 kB)
Email in envelope

Get an email when there's a new version of CoVEC

Home
Name Modified Size InfoDownloads / Week
README.txt 2013-06-01 10.2 kB
covec_distribution-v0.4.tar.gz 2013-06-01 125.0 kB
covec_distribution-v0.3.tar.gz 2013-05-19 123.3 kB
covec_distribution-v0.2.tar.gz 2012-10-09 121.8 kB
Totals: 4 Items   380.3 kB 2
README v.0.4
(K.Frousios, MAY2013)

Changes in CoVEC v0.4
  - Update VarLib::MassessSubmit to reflect server update
  - Increase the flexibility for file format recognition in VarLib::SiftResults
  - Adjustments to the collate.pl script
  - Improved library detection for collate.pl, snpsngo_submit.pl and massess_submit.pl
  - Add the new "*DAMAGING" prediction label of SIFT to the list recognized by wv.pl
  - Various minor tweaks and fixes in the documentation to reflect the above changes
  - Enable all four VarLib::*Results modules to power through some of the errors encountered 
    when parsing the predictions of variants, instead of immediately aborting execution.
  - Fixed bugs in collate.pl that cause the wrong method results to be queried.
  - Fixed bugs in collate.pl that called on undefined objects, when data is not present from all methods.
  - Fixed collate.pl so it does not output value-less fields.
  - Fixed VarLib::MassessSubmit so that the file header is not hard-coded. That way it can keep up with
    the changes in the output format of Mutation Assessor.
  - Fixed bugs in collate.pl that caused scores of 0 to be treated as missing predictions.
  
Changes in version 0.3:
  - Updated SnpsnGoSubmit to match the migrated server.
  - Fixed some other bugs in VarLib::SnpsnGoSubmit concerning input auto-detection.

---------------------------------------------------------------------------------------------------
 
README v0.2 
(K.Frousios, JUL2012)



INTRODUCTION

CoVEC is a consensus tool for prediction of coding SNP effect. 

Two consensus methods are supplied:
[A] wv.pl : This PERL script implements the Weighted Majority Vote method.
[B] linear-model.svm , radial-model.svm : These contain the Support Vector Machines models, to be used with SVMlight (Joachims, 1999)*.

The result is a binary classification of SNPs, based on the scores of up to 4 third-party methods: SIFT (Henikoff & Ng, 2003)**, 
PolyPhen2 (Adzhubei et al, 2010)**, SNPs&GO (Calabrese et al, 2009)** and Mutation Assessor (Reva et al, 2011)**. The consensus can
work with any subset of these 4 methods.

* SVMlight is not re-distributed with our method. Binaries and source can be easily obtained from its proper website:
http://svmlight.joachims.org/

** These four classifiers are not re-distributed with our method. Results for them must be obtained independently from their respective resources.
-> The Sift and Polyphen2 websited have satisfactory batch capabilities. Data can be submitted manually and results downloaded in a single text file.
   Local installation options also exist.
-> Mutation Assessor also has batch capability, but I do provide a script (massess_submit.pl) to access its web-API, as it may prove more convenient.
   It requires the NCBI peptide RefSeq code and the amino acid substitution.
-> SNPs&GO does not offer batch processing, except for when all substitutions are on the same protein. I provide a script (snpsngo_submit.pl) to assist
   in processing batches of variants from different proteins. It requires the amino acid substitution and protein sequence. However, SNPs&GO runs
   faster and more accurately when the Uniprot code is given instead of the sequence. Currently I have not been able to make that work via script.

   
   
INSTALLATION

All supplied perl scripts and modules are immediately usable.

SVM: 
Download SVMlight from http://svmlight.joachims.org/ . Self-contained executables should be available there, as well as source code. 



HOW TO USE

The following guide assumes no pre-requisites. However, you are free to develop your own solutions for the preparation steps
 leading to the data and format required by SVMlight. 
The scripts provided are flexible with the files they can parse. They each require tab-delimitted text files as input, and
 will auto-detect the relevant data columns based on the labels in the header line of each file. The actual order of the 
 columns in the files is not important, as long as the required labels are present.


+ STEP 0: Data preparation
--------------------------
It is practical, to prepare your variants in a tab-separated file, similar to index.example in the examples folder.
 You can then use this file to submit variants to SNPs&GO and Mutation Assessor, as well as to guide the collation
 of the different prediction outputs later on.


+ STEP 1: Data collection
-------------------------
Results for SIFT, PolyPhen2, SNPs&GO and Mutation Assessor must first be obtained. SIFT and PolyPhen2 have batch submission forms
 on their websites. For each of Mutation Assessor and SNPs&GO, you'll find a submission script in the scripts folder.
 
  Help:
  [perl] ./scripts/massess_submit.pl
  [perl] ./scripts/snpsNgo_submit.pl
  
  Run:
  [perl] ./scripts/massess_submit.pl [-v <variants.file>] [-o <output.file>]
  [perl] ./scripts/snpsNgo_submit.pl [-v <variants.file>] [-o <output.file>]

  Pipe both input and output:
  [perl] ./scripts/massess_submit.pl -p 1
  [perl] ./scripts/snpsNgo_submit.pl -p 1
  
These two scripts take as input format a tab-separated file (see examples/index.example), and output another tab-separated file.
 Both of them will auto-detect certain column labels in the header row of the file. To see which labels are required, trigger the
 help message of each script.

Either or both the -v and -o arguments may be omitted, in which case the scripts will read/write to STDIN/STDOUT accordingly.
 However, because complete lack of arguments triggers the help message, it is necessary to supply the token argument -p 1 in order
 to pipe both streams.


+ STEP 2: Output collation
--------------------------
Before applying the consensus, the predictions need to be collated into a single file, with a specific format, for SVMlight.

The script collate.pl is provided for this task. It assumes the predictions are in the following format:
- SIFT and Polyphen2 - from online batch submissions
- Mutation Assessor and SNPs&GO - from the two scripts supplied here.
The script will auto-detect the relevant data column based on their labels, therefore allowing custom files to be used as well.
 See the perldoc documentations of the following supplied modules, which are used by collate.pl, in order to find out which labels
 are required:
 VarLib::MassessResults, VarLib::PolyphenResults, VarLib::SiftResults and VarLib::SnpsngoResults.

There are two ways to collate the outputs. If you have created the recommended index file (STEP 0), it can be used to pull together
 the matching predictions from the different files, even if the order of the predictions is shuffled. 
 If no such index file is specified, collation will be done line by line, by assuming the predictions are in the same order. A rudimentary
 check is carried out by comparing the amino acid substitutions across the files. If they differ, a warning will be recorded in the log,
 but the script will continue, as the difference may be due to database inconsistencies. 
 (!) It is the user's resposibility to ensure the predictions are all in the right order and to inspect any inconsistencies reported.

  [perl] ./scripts/collate.pl  -o <output.file> [-f {score/label}] [-i <index.file>] [-s <sift.file>] [-p <polyphen.file>] [-g <snpsngo.file>] [-m <massess.file>] 

The script does not require output files from all four tools; it can work with any subset of them.
The -f argument takes the value "score" or "label". This defines if the output of collate.pl will be the scores of the methods, or
 the class labels assigned by the tools. The score is used by the SVM consensus, while the text labels are used by the Weighted Vote 
 consensus.

Sample output is provided: examples/data_svm.example , examples/data_wv.example


 + STEP 3: Consensus
--------------------

[A]-Weighted Majority Vote

  [perl] ./wv.pl <file.data> <output.file>

Here's a list of the recognized class labels:
 Not scored
 NA
 N/A
 TOLERATED
 benign
 Neutral
 neutral
 low
 DAMAGING *Warning! Low confidence.
 *DAMAGING
 possibly damaging
 medium
 DAMAGING
 probably damaging
 Disease
 high
 deleterious


[B]-SVM

  ./svm_classify <file.data> <model.svm> <output.file>

- file.data is the file with the collated predictions
- model.svm is either models/linear-model.svm or models/radial-model.svm
- output_file is where you want your classification results to be stored


OUTPUT:
.......

Both [A] and [B] create a file with a list of classification score values.
 value >0 : damaging
 value <0 : neutral
 value =0 : can't classify
The values are in the same order as the input data. Every input query receives a score.

---------------------------------------------------------------------------------------------------

COPYRIGHT NOTICE
Kimon Frousios, JUN2012

The contents of this distribution are all written by me. They are free for academic use. 
Modifications are permitted for own use only.
Any third-party software employed by this method, is used via public interfaces, and is not re-distributed by myself 
or King's College London.


Third-Party Software:

This method relies on a number of external resources:
Sift[1], PolyPhen2[2], SNPs&GO[3], Mutation Assessor[4] and SVMlight[5] are third-party software. 
Neither I nor King's College London are in any way affiliated with their authors, nor are we responsible in any way for their maintenance.


[1] P. Kumar, S. Henikoff and P.C. Ng. Predicting the effect of coding non-synonymous
variants on protein function using the SIFT algorithm. Nature Protocols 4(7):1073, 2009.
[2] I.A. Adzhubei, S. Schmidt, L. Peshkin, et al. A method and server for predicting damaging
missense mutations. Nature Methods 7(4):248, 2010.
[3] R. Calabrese, E. Capriotti, P. Fariselli, et al. Functional annotations improve the predictive
score of human disease-related mutations in proteins. Human Mutation 30:1237, 2009.
[4] B. Reva, Y. Antipin and C. Sander. Predicting the functional impact of protein mutations to
cancer genomics. Nucleic Acids Research 39(17):e118, 2011.
[5]  T. Joachims. Making large-scale SVM learning practical. Advances in Kernel Methods –
Support Vector Learning, ch.11, MIT-Press 1999.

Source: README.txt, updated 2013-06-01