Home
Name Modified Size Downloads / Week Status
Totals: 4 Items   380.3 kB 4
README.txt 2013-06-01 10.2 kB 11 weekly downloads
covec_distribution-v0.4.tar.gz 2013-06-01 125.0 kB 11 weekly downloads
covec_distribution-v0.3.tar.gz 2013-05-19 123.3 kB 11 weekly downloads
covec_distribution-v0.2.tar.gz 2012-10-09 121.8 kB 11 weekly downloads
README v.0.4 (K.Frousios, MAY2013) Changes in CoVEC v0.4 - Update VarLib::MassessSubmit to reflect server update - Increase the flexibility for file format recognition in VarLib::SiftResults - Adjustments to the collate.pl script - Improved library detection for collate.pl, snpsngo_submit.pl and massess_submit.pl - Add the new "*DAMAGING" prediction label of SIFT to the list recognized by wv.pl - Various minor tweaks and fixes in the documentation to reflect the above changes - Enable all four VarLib::*Results modules to power through some of the errors encountered when parsing the predictions of variants, instead of immediately aborting execution. - Fixed bugs in collate.pl that cause the wrong method results to be queried. - Fixed bugs in collate.pl that called on undefined objects, when data is not present from all methods. - Fixed collate.pl so it does not output value-less fields. - Fixed VarLib::MassessSubmit so that the file header is not hard-coded. That way it can keep up with the changes in the output format of Mutation Assessor. - Fixed bugs in collate.pl that caused scores of 0 to be treated as missing predictions. Changes in version 0.3: - Updated SnpsnGoSubmit to match the migrated server. - Fixed some other bugs in VarLib::SnpsnGoSubmit concerning input auto-detection. --------------------------------------------------------------------------------------------------- README v0.2 (K.Frousios, JUL2012) INTRODUCTION CoVEC is a consensus tool for prediction of coding SNP effect. Two consensus methods are supplied: [A] wv.pl : This PERL script implements the Weighted Majority Vote method. [B] linear-model.svm , radial-model.svm : These contain the Support Vector Machines models, to be used with SVMlight (Joachims, 1999)*. The result is a binary classification of SNPs, based on the scores of up to 4 third-party methods: SIFT (Henikoff & Ng, 2003)**, PolyPhen2 (Adzhubei et al, 2010)**, SNPs&GO (Calabrese et al, 2009)** and Mutation Assessor (Reva et al, 2011)**. The consensus can work with any subset of these 4 methods. * SVMlight is not re-distributed with our method. Binaries and source can be easily obtained from its proper website: http://svmlight.joachims.org/ ** These four classifiers are not re-distributed with our method. Results for them must be obtained independently from their respective resources. -> The Sift and Polyphen2 websited have satisfactory batch capabilities. Data can be submitted manually and results downloaded in a single text file. Local installation options also exist. -> Mutation Assessor also has batch capability, but I do provide a script (massess_submit.pl) to access its web-API, as it may prove more convenient. It requires the NCBI peptide RefSeq code and the amino acid substitution. -> SNPs&GO does not offer batch processing, except for when all substitutions are on the same protein. I provide a script (snpsngo_submit.pl) to assist in processing batches of variants from different proteins. It requires the amino acid substitution and protein sequence. However, SNPs&GO runs faster and more accurately when the Uniprot code is given instead of the sequence. Currently I have not been able to make that work via script. INSTALLATION All supplied perl scripts and modules are immediately usable. SVM: Download SVMlight from http://svmlight.joachims.org/ . Self-contained executables should be available there, as well as source code. HOW TO USE The following guide assumes no pre-requisites. However, you are free to develop your own solutions for the preparation steps leading to the data and format required by SVMlight. The scripts provided are flexible with the files they can parse. They each require tab-delimitted text files as input, and will auto-detect the relevant data columns based on the labels in the header line of each file. The actual order of the columns in the files is not important, as long as the required labels are present. + STEP 0: Data preparation -------------------------- It is practical, to prepare your variants in a tab-separated file, similar to index.example in the examples folder. You can then use this file to submit variants to SNPs&GO and Mutation Assessor, as well as to guide the collation of the different prediction outputs later on. + STEP 1: Data collection ------------------------- Results for SIFT, PolyPhen2, SNPs&GO and Mutation Assessor must first be obtained. SIFT and PolyPhen2 have batch submission forms on their websites. For each of Mutation Assessor and SNPs&GO, you'll find a submission script in the scripts folder. Help: [perl] ./scripts/massess_submit.pl [perl] ./scripts/snpsNgo_submit.pl Run: [perl] ./scripts/massess_submit.pl [-v <variants.file>] [-o <output.file>] [perl] ./scripts/snpsNgo_submit.pl [-v <variants.file>] [-o <output.file>] Pipe both input and output: [perl] ./scripts/massess_submit.pl -p 1 [perl] ./scripts/snpsNgo_submit.pl -p 1 These two scripts take as input format a tab-separated file (see examples/index.example), and output another tab-separated file. Both of them will auto-detect certain column labels in the header row of the file. To see which labels are required, trigger the help message of each script. Either or both the -v and -o arguments may be omitted, in which case the scripts will read/write to STDIN/STDOUT accordingly. However, because complete lack of arguments triggers the help message, it is necessary to supply the token argument -p 1 in order to pipe both streams. + STEP 2: Output collation -------------------------- Before applying the consensus, the predictions need to be collated into a single file, with a specific format, for SVMlight. The script collate.pl is provided for this task. It assumes the predictions are in the following format: - SIFT and Polyphen2 - from online batch submissions - Mutation Assessor and SNPs&GO - from the two scripts supplied here. The script will auto-detect the relevant data column based on their labels, therefore allowing custom files to be used as well. See the perldoc documentations of the following supplied modules, which are used by collate.pl, in order to find out which labels are required: VarLib::MassessResults, VarLib::PolyphenResults, VarLib::SiftResults and VarLib::SnpsngoResults. There are two ways to collate the outputs. If you have created the recommended index file (STEP 0), it can be used to pull together the matching predictions from the different files, even if the order of the predictions is shuffled. If no such index file is specified, collation will be done line by line, by assuming the predictions are in the same order. A rudimentary check is carried out by comparing the amino acid substitutions across the files. If they differ, a warning will be recorded in the log, but the script will continue, as the difference may be due to database inconsistencies. (!) It is the user's resposibility to ensure the predictions are all in the right order and to inspect any inconsistencies reported. [perl] ./scripts/collate.pl -o <output.file> [-f {score/label}] [-i <index.file>] [-s <sift.file>] [-p <polyphen.file>] [-g <snpsngo.file>] [-m <massess.file>] The script does not require output files from all four tools; it can work with any subset of them. The -f argument takes the value "score" or "label". This defines if the output of collate.pl will be the scores of the methods, or the class labels assigned by the tools. The score is used by the SVM consensus, while the text labels are used by the Weighted Vote consensus. Sample output is provided: examples/data_svm.example , examples/data_wv.example + STEP 3: Consensus -------------------- [A]-Weighted Majority Vote [perl] ./wv.pl <file.data> <output.file> Here's a list of the recognized class labels: Not scored NA N/A TOLERATED benign Neutral neutral low DAMAGING *Warning! Low confidence. *DAMAGING possibly damaging medium DAMAGING probably damaging Disease high deleterious [B]-SVM ./svm_classify <file.data> <model.svm> <output.file> - file.data is the file with the collated predictions - model.svm is either models/linear-model.svm or models/radial-model.svm - output_file is where you want your classification results to be stored OUTPUT: ....... Both [A] and [B] create a file with a list of classification score values. value >0 : damaging value <0 : neutral value =0 : can't classify The values are in the same order as the input data. Every input query receives a score. --------------------------------------------------------------------------------------------------- COPYRIGHT NOTICE Kimon Frousios, JUN2012 The contents of this distribution are all written by me. They are free for academic use. Modifications are permitted for own use only. Any third-party software employed by this method, is used via public interfaces, and is not re-distributed by myself or King's College London. Third-Party Software: This method relies on a number of external resources: Sift[1], PolyPhen2[2], SNPs&GO[3], Mutation Assessor[4] and SVMlight[5] are third-party software. Neither I nor King's College London are in any way affiliated with their authors, nor are we responsible in any way for their maintenance. [1] P. Kumar, S. Henikoff and P.C. Ng. Predicting the effect of coding non-synonymous variants on protein function using the SIFT algorithm. Nature Protocols 4(7):1073, 2009. [2] I.A. Adzhubei, S. Schmidt, L. Peshkin, et al. A method and server for predicting damaging missense mutations. Nature Methods 7(4):248, 2010. [3] R. Calabrese, E. Capriotti, P. Fariselli, et al. Functional annotations improve the predictive score of human disease-related mutations in proteins. Human Mutation 30:1237, 2009. [4] B. Reva, Y. Antipin and C. Sander. Predicting the functional impact of protein mutations to cancer genomics. Nucleic Acids Research 39(17):e118, 2011. [5] T. Joachims. Making large-scale SVM learning practical. Advances in Kernel Methods – Support Vector Learning, ch.11, MIT-Press 1999.
Source: README.txt, updated 2013-06-01