Recent changes to Project_overview

WikiPage Project_overview modified by Mingkun li

Mingkun li — Wed, 30 May 2012 09:25:16 -0000

--- v48
+++ v49
@@ -1,123 +1,2 @@
-#####Name: 
----
-***D***etecting low-level mutations by utilizing the ***re***-sequencing ***e***rror ***p***rofile of the data (***Dreep***)
-#####Feathers:
------
-#####Usage:
----
+Now, I stop updating information here, instead, please find the update at 
 
-### Now, I stop updating information here, instead, please find the update at 
-
-1.Generating pileup file from sorted SAM file
-To build a pileup file, you need CRISP package  (in case they shut down the webpage sometimes, you can contact its author or me)
->python sam_to_pileup.py indi1.sorted.sam refsequence.fasta  > indi1.pileup 
-
-2.Generating ssp file from pileup file
-Refine the pileup file by mismatch number, quality score, mapping quality score
->perl filter_and_summary.pl
->***-i*** pileup file 
->***-d*** number of mismatches allowed 
->***-q*** minimum quality score
->***-m*** minimum mapping quality score
->***-r*** length of the reads
->***-s*** length of the read bins<10bp>
-> to save the output, use "> output.file" 
-
-3.Generating error profile by using reference panel (for POISSON method)
-All the ssp files should be saved in one folder, and with a specific suffix
->perl error_profile_pois.pl
->***-d*** folder including all population data(in ssp format)
->***-s*** suffix of the ssp files
->***-i*** length of the read bins<10bp>
->***-r*** length of the reads
-> by using this scrip, you should have error_pois.index and error_pois.position under the same folder
-
-
-3.1 Generating error profile by using reference panel (for EMP method)
-All the ssp files should be saved in one folder, and with a specific suffix
->perl error_profile_emp.pl
->***-d*** folder including all population data(in ssp format)
->***-s*** suffix of the ssp files
->***-i*** length of the read bins<10bp>
->***-m*** minimum number of samples,otherwise,combine nucleotides
->***-t*** minimum coverage in each bin
-> by using this scrip, you should have error_emp.index under the same folder
-> -t: the bin whose coverage is lower than that defined by <-t> won't be included in the reference error database
-> -m: if the entry number in the reference error database is lower than that defined by <-m>, different nucleotides will be merged if their T-test p-value>0.005
-> If the entry number in the reference error database is lower than 10, nothing will be applied as it could be highly biased
-> If the entry number in the reference error database is >50 and lower than <-m>, fake error rate would be generated assuming a normal distribution
-
-4.Quality score calculation
-
-With Poisson distribution
->perl Dreep_poisson.pl
->***-i*** ssp file
->***-r*** error profile file (error_pois.index)
->***-n*** length of the reads bins(10bp)
->***-t*** remove the C-stretch and STR (for mitochondrial data)
->***-e*** use it when the questioned individual is included in the reference panel
->***-d*** minimum coverage to apply consensus-specific error profile(1000)
-
-With empirical distribution
-
->perl Dreep_emp.pl
->***-i*** ssp file
->***-r*** error profile file (error_emp.index)
->***-s*** suffix of the ssp files
->***-l*** length of the read bins(10bp)
->***-m*** minimum number of error rate entries in error_emp.index,otherwise,merge different nucleotides at the same position
->***-s*** lowest p_value for each bin(the empirical p_value depends on the sample size, <-s> is used to define the lower bound 0.01~0.001 would be fine).
-
-An output with suffix of pois.log/emp.log will be generated for each sample
-
-5. Specify your own heteroplasmy(LLM) criteria
-
->perl Dreep_poi_filter.pl
->***-n*** directory including all the log file (output of Dreep_emp.pl Dreep_pois.pl)
->***-s*** suffix of all the result file (log)
->***-a*** minor allele frequency
->***-b*** minor allele frequency per strand
->***-c*** minor allele count
->***-d*** minor allele count per strand
->***-e*** minor allele count (distinct reads)
->***-f*** minor allele count per strand(distinct reads)
->***-g*** minium coverage
->***-h*** minimum QS
->***-i*** minimum QS per strand
->***-j*** minimum perc of supported reads (for pois only)
->***-k*** minimum perc of supported reads per strand (for pois only)
->***-l*** maximum perc of the 3rd/3rd+4th allele (out of all non-major allele)
->***-t*** ignore the C-stretch region
-If you want it works for one file, give more specific suffix
-
-
-#####Perl module required:
----
-Text::NSP::Measures::2D::Fisher::twotailed
-Math::CDF;
-Statistics::Descriptive 
-Statistics::PointEstimation (EMP method)
-Statistics::TTest (EMP method)
-Statistics::Distrib::Normal (EMP method)
-#####Test data:
----
-A test dataset (134 whole mitochondrial sequencing data) are publicly available from the European Nucleotide Archive’s Sequence Read Archive (http://www.ebi.ac.uk/ena/) through accession number ERP000879.
-
-More data (double indices, paired-end) would be released soon (with the original project)
-#####Notice:
----
-This version can not do the analysis from a specified position, instead,it always starts from the 1st position, even you don't have it in your pileup file. So the reference should be your target sequence to make it faster.
-
-length of the read bin should be determined by the development of the error rate on the read, with the test data (mtDNA mock mixture, sequencing coverage=2000x, 76bp), a size of 5bp, 10bp, 20bp give the same result(no false positive/false negative). Theoretically, a larger sampe size(reads) in each bin gives a better result, but the number of reads bins should be at least 4, to counteract the effect of duplicate reads. 
-
-#####Version
-v0.2 EMP method has the similar pipeline to POISSON method, output file includes more information;Threshold for Standard output (on Screen in Dreep_poisson.pl) were updated according to the result of NUMTs project (MAF>0.02,DQS>10,Other allele frequency <0.2)
-v0.1 first version
-
-#####Citation:
----
-manuscript submitted
-
-#####Contact the authours:
----
-fengzys[at]users.sourceforge.net

WikiPage Project_overview modified by Mingkun li

Mingkun li — Thu, 05 Apr 2012 08:09:08 -0000

--- v47 
+++ v48 
@@ -6,7 +6,7 @@
 #####Usage:
 ---
 
-Now, I stop updating information here, instead, please find the update at dmcrop.sourceforge.net
+### Now, I stop updating information here, instead, please find the update at 
 
 1.Generating pileup file from sorted SAM file
 To build a pileup file, you need CRISP package  (in case they shut down the webpage sometimes, you can contact its author or me)

WikiPage Project_overview modified by Mingkun li

Mingkun li — Thu, 05 Apr 2012 08:04:30 -0000

--- v46 
+++ v47 
@@ -5,6 +5,9 @@
 -----
 #####Usage:
 ---
+
+Now, I stop updating information here, instead, please find the update at dmcrop.sourceforge.net
+
 1.Generating pileup file from sorted SAM file
 To build a pileup file, you need CRISP package  (in case they shut down the webpage sometimes, you can contact its author or me)
 >python sam_to_pileup.py indi1.sorted.sam refsequence.fasta  > indi1.pileup

WikiPage Project_overview modified by Mingkun li

Mingkun li — Thu, 01 Mar 2012 10:30:02 -0000

--- v45 
+++ v46 
@@ -22,7 +22,7 @@
 
 3.Generating error profile by using reference panel (for POISSON method)
 All the ssp files should be saved in one folder, and with a specific suffix
->perl error_profile_poisson.pl
+>perl error_profile_pois.pl
 >***-d*** folder including all population data(in ssp format)
 >***-s*** suffix of the ssp files
 >***-i*** length of the read bins<10bp>

WikiPage Project_overview modified by Mingkun li

Mingkun li — Thu, 01 Mar 2012 10:28:16 -0000

--- v44 
+++ v45 
@@ -22,17 +22,17 @@
 
 3.Generating error profile by using reference panel (for POISSON method)
 All the ssp files should be saved in one folder, and with a specific suffix
->perl error_profile_emp.pl
+>perl error_profile_poisson.pl
 >***-d*** folder including all population data(in ssp format)
 >***-s*** suffix of the ssp files
 >***-i*** length of the read bins<10bp>
 >***-r*** length of the reads
 > by using this scrip, you should have error_pois.index and error_pois.position under the same folder
 
 
 3.1 Generating error profile by using reference panel (for EMP method)
 All the ssp files should be saved in one folder, and with a specific suffix
->perl error_profile_pois.pl
+>perl error_profile_emp.pl
 >***-d*** folder including all population data(in ssp format)
 >***-s*** suffix of the ssp files
 >***-i*** length of the read bins<10bp>

WikiPage Project_overview modified by Mingkun li

Mingkun li — Fri, 10 Feb 2012 16:21:02 -0000

--- v43 
+++ v44 
@@ -108,7 +108,7 @@
 length of the read bin should be determined by the development of the error rate on the read, with the test data (mtDNA mock mixture, sequencing coverage=2000x, 76bp), a size of 5bp, 10bp, 20bp give the same result(no false positive/false negative). Theoretically, a larger sampe size(reads) in each bin gives a better result, but the number of reads bins should be at least 4, to counteract the effect of duplicate reads. 
 
 #####Version
-v0.2 EMP method has the similar pipeline to POISSON method, output file includes more information
+v0.2 EMP method has the similar pipeline to POISSON method, output file includes more information;Threshold for Standard output (on Screen in Dreep_poisson.pl) were updated according to the result of NUMTs project (MAF>0.02,DQS>10,Other allele frequency <0.2)
 v0.1 first version
 
 #####Citation:

WikiPage Project_overview modified by Mingkun li

Mingkun li — Mon, 23 Jan 2012 14:25:25 -0000

--- v42 
+++ v43 
@@ -44,26 +44,28 @@
 > If the entry number in the reference error database is lower than 10, nothing will be applied as it could be highly biased
 > If the entry number in the reference error database is >50 and lower than <-m>, fake error rate would be generated assuming a normal distribution
 
-4.Detecting low-level mutations by utilizing the error profile
-
+4.Quality score calculation
+
 With Poisson distribution
 >perl Dreep_poisson.pl
 >***-i*** ssp file
 >***-r*** error profile file (error_pois.index)
 >***-n*** length of the reads bins(10bp)
 >***-t*** remove the C-stretch and STR (for mitochondrial data)
 >***-e*** use it when the questioned individual is included in the reference panel
 >***-d*** minimum coverage to apply consensus-specific error profile(1000)
 
 With empirical distribution
 
 >perl Dreep_emp.pl
 >***-i*** ssp file
 >***-r*** error profile file (error_emp.index)
 >***-s*** suffix of the ssp files
 >***-l*** length of the read bins(10bp)
 >***-m*** minimum number of error rate entries in error_emp.index,otherwise,merge different nucleotides at the same position
 >***-s*** lowest p_value for each bin(the empirical p_value depends on the sample size, <-s> is used to define the lower bound 0.01~0.001 would be fine).
+
+An output with suffix of pois.log/emp.log will be generated for each sample
 
 5. Specify your own heteroplasmy(LLM) criteria

WikiPage Project_overview modified by Mingkun li

Mingkun li — Mon, 23 Jan 2012 14:17:26 -0000

--- v41 
+++ v42 
@@ -106,7 +106,7 @@
 length of the read bin should be determined by the development of the error rate on the read, with the test data (mtDNA mock mixture, sequencing coverage=2000x, 76bp), a size of 5bp, 10bp, 20bp give the same result(no false positive/false negative). Theoretically, a larger sampe size(reads) in each bin gives a better result, but the number of reads bins should be at least 4, to counteract the effect of duplicate reads. 
 
 #####Version
-v0.2 EMP method has the similar pipeline to POISSON method
+v0.2 EMP method has the similar pipeline to POISSON method, output file includes more information
 v0.1 first version
 
 #####Citation:

WikiPage Project_overview modified by Mingkun li

Mingkun li — Mon, 23 Jan 2012 14:13:23 -0000

--- v40 
+++ v41 
@@ -65,6 +65,25 @@
 >***-m*** minimum number of error rate entries in error_emp.index,otherwise,merge different nucleotides at the same position
 >***-s*** lowest p_value for each bin(the empirical p_value depends on the sample size, <-s> is used to define the lower bound 0.01~0.001 would be fine).
 
+5. Specify your own heteroplasmy(LLM) criteria
+
+>perl Dreep_poi_filter.pl
+>***-n*** directory including all the log file (output of Dreep_emp.pl Dreep_pois.pl)
+>***-s*** suffix of all the result file (log)
+>***-a*** minor allele frequency
+>***-b*** minor allele frequency per strand
+>***-c*** minor allele count
+>***-d*** minor allele count per strand
+>***-e*** minor allele count (distinct reads)
+>***-f*** minor allele count per strand(distinct reads)
+>***-g*** minium coverage
+>***-h*** minimum QS
+>***-i*** minimum QS per strand
+>***-j*** minimum perc of supported reads (for pois only)
+>***-k*** minimum perc of supported reads per strand (for pois only)
+>***-l*** maximum perc of the 3rd/3rd+4th allele (out of all non-major allele)
+>***-t*** ignore the C-stretch region
+If you want it works for one file, give more specific suffix
 
 
 #####Perl module required:

WikiPage Project_overview modified by Mingkun li

Mingkun li — Mon, 23 Jan 2012 14:08:51 -0000

--- v39 
+++ v40 
@@ -20,63 +20,74 @@
 >***-s*** length of the read bins<10bp>
 > to save the output, use "> output.file" 
 
-3.Generating error profile by using reference panel
-All the ssp files should be saved in one folder, and with a specific suffix
->perl error_profile.pl
->***-d*** folder including all population data(in ssp format)
->***-s*** suffix of the ssp files
->***-i*** length of the read bins<10bp>
+3.Generating error profile by using reference panel (for POISSON method)
+All the ssp files should be saved in one folder, and with a specific suffix
+>perl error_profile_emp.pl
+>***-d*** folder including all population data(in ssp format)
+>***-s*** suffix of the ssp files
+>***-i*** length of the read bins<10bp>
 >***-r*** length of the reads
-> by using this scrip, you should have error.index and error.position under the same folder
+> by using this scrip, you should have error_pois.index and error_pois.position under the same folder
+
+
+3.1 Generating error profile by using reference panel (for EMP method)
+All the ssp files should be saved in one folder, and with a specific suffix
+>perl error_profile_pois.pl
+>***-d*** folder including all population data(in ssp format)
+>***-s*** suffix of the ssp files
+>***-i*** length of the read bins<10bp>
+>***-m*** minimum number of samples,otherwise,combine nucleotides
+>***-t*** minimum coverage in each bin
+> by using this scrip, you should have error_emp.index under the same folder
+> -t: the bin whose coverage is lower than that defined by <-t> won't be included in the reference error database
+> -m: if the entry number in the reference error database is lower than that defined by <-m>, different nucleotides will be merged if their T-test p-value>0.005
+> If the entry number in the reference error database is lower than 10, nothing will be applied as it could be highly biased
+> If the entry number in the reference error database is >50 and lower than <-m>, fake error rate would be generated assuming a normal distribution
 
 4.Detecting low-level mutations by utilizing the error profile
 
 With Poisson distribution
 >perl Dreep_poisson.pl
 >***-i*** ssp file
->***-r*** error profile file (error.index)
+>***-r*** error profile file (error_pois.index)
 >***-n*** length of the reads bins(10bp)
 >***-t*** remove the C-stretch and STR (for mitochondrial data)
 >***-e*** use it when the questioned individual is included in the reference panel
 >***-d*** minimum coverage to apply consensus-specific error profile(1000)
 
-With Fisher Exact test
-
->perl Dreep_fisher.pl
->***-i*** ssp file
->***-r*** error profile file (error.index)
->***-n*** length of the read bins(10bp)
->***-t*** remove the C-stretch and STR (for mitochondrial data)
->***-e*** use it when the questioned individual is included in the reference panel
->***-d*** minimum coverage to apply consensus-specific error profile(1000)
-
 With empirical distribution
 
 >perl Dreep_emp.pl
->***-d*** folder including all the ssp files
->***-s*** suffix of the ssp files
->***-i*** length of the read bins(10bp)
-
-Dreep_poisson.pl Dreep_fisher.pl work individually
-Dreep_emp.pl work for all the individuals at one time
+>***-i*** ssp file
+>***-r*** error profile file (error_emp.index)
+>***-s*** suffix of the ssp files
+>***-l*** length of the read bins(10bp)
+>***-m*** minimum number of error rate entries in error_emp.index,otherwise,merge different nucleotides at the same position
+>***-s*** lowest p_value for each bin(the empirical p_value depends on the sample size, <-s> is used to define the lower bound 0.01~0.001 would be fine).
+
+
 
 #####Perl module required:
 ---
 Text::NSP::Measures::2D::Fisher::twotailed
 Math::CDF;
-Statistics::Descriptive
+Statistics::Descriptive 
+Statistics::PointEstimation (EMP method)
+Statistics::TTest (EMP method)
+Statistics::Distrib::Normal (EMP method)
 #####Test data:
 ---
 A test dataset (134 whole mitochondrial sequencing data) are publicly available from the European Nucleotide Archive’s Sequence Read Archive (http://www.ebi.ac.uk/ena/) through accession number ERP000879.
 
 More data (double indices, paired-end) would be released soon (with the original project)
 #####Notice:
 ---
 This version can not do the analysis from a specified position, instead,it always starts from the 1st position, even you don't have it in your pileup file. So the reference should be your target sequence to make it faster.
 
 length of the read bin should be determined by the development of the error rate on the read, with the test data (mtDNA mock mixture, sequencing coverage=2000x, 76bp), a size of 5bp, 10bp, 20bp give the same result(no false positive/false negative). Theoretically, a larger sampe size(reads) in each bin gives a better result, but the number of reads bins should be at least 4, to counteract the effect of duplicate reads. 
 
 #####Version
+v0.2 EMP method has the similar pipeline to POISSON method
 v0.1 first version
 
 #####Citation: