Recent changes to Validate_reads

Validate_reads modified by Robert Kofler

Robert Kofler — Tue, 25 Jul 2017 13:40:28 -0000

--- v41
+++ v42
@@ -88,7 +88,7 @@

 #PacBio
-PacBio reads frequently show a bimodal length distribution. SimulaTE allows to specify a read length distribution, with the simulated reads following this distribution. 
+PacBio reads frequently show a bimodal length distribution. SimulaTE allows to specify a read length distribution, with simulated reads following this distribution. 
 In this validation we use the following read length distribution: 

@@ -97,13 +97,16 @@
 the file can be obtained here https://sourceforge.net/projects/simulates/files/validation_reads/rld.txt/download

-We simulate PacBio reads like in the following:
+We simulate PacBio reads:
+
 ~~~~~
 python ~/dev/simulate/read_pool-seq_pacbio.py --pg chasis.fa --rld-file rld.txt --error-rate 0.1 --reads 10000 --fasta reads.fa
 ~~~~~
-We simulated 10k reads with an error rate of 0.1 (50% insertions and 50% deletions; this may be adjusted by the user).
+
+We simulated 10k reads with an error rate of 0.1 (50% insertions and 50% deletions; this value may be adjusted by the user).

 ##Read length distribution
+
 We computed the read length distribution:
 ~~~~~~
 samtools faidx reads.fa
@@ -116,7 +119,7 @@
 We used the following R-script to test whether the expected and observed distribution are identical https://sourceforge.net/projects/simulates/files/validation_reads/rld-ks.R/download

 
-The simulated and the observed read length distribution are not significantly different (Two-sample Kolmogorov-Smirnov test; D=0.0074, p=0.63),  demonstrating that SimulaTE accurately reproduces a given read length distribution (e.g. PacBio)
+The simulated and the observed read length distribution are not significantly different (Two-sample Kolmogorov-Smirnov test; D=0.0074, p=0.63),  demonstrating that SimulaTE accurately reproduces a given read length distribution

 ##Error rate
@@ -128,7 +131,7 @@

 Unfortunatelly Picard decided not to process these data (it crashed with ArrayIndex out of boundaries). Possibly it is not yet adapted to bam files generated from very long reads.

-Therefore we used a different approach using a custom perl script: https://sourceforge.net/projects/simulates/files/validation_reads/count-error.pl/download
+Therefore we used a different approach, i.e. a custom perl script: https://sourceforge.net/projects/simulates/files/validation_reads/count-error.pl/download

 ~~~~~~
 samtools mpileup -x -AB -Q 0 -f chasis.fa map-pacbio.sort.bam.bam > pacbio.mpileup
@@ -143,10 +146,12 @@
 * **deletions** simulated=0.05; obtained= 0.0406
 * **insertions** simulated=0.05; obtained=0.0406

-The small difference between the simulated and obtained indel rate may be due to difficulties of aligning reads with indels; This is for example demonstrated by the vast number of observed base substitutions (1019677). Since no base substitions were simulated the obtained ones must be an artefact of incorrect alignments. Accordingly the total error rate (indels + base substitutions) is with 9.6% almost identical to the simulated 10%.
+The small difference between the simulated and obtained indel rate may be due to difficulties of aligning reads with indels; This is for example demonstrated by the vast number of observed base substitutions (1019677). Since no base substitions were simulated the obtained ones must be an artefact of incorrect alignments. In agreement with this, the total error rate (indels + base substitutions) is with 9.6% almost identical to the simulated 10%.

 ##Read distribution and coverage
+
+Finally we investigated the coverage.

 ~~~~~~
 cat pacbio.mpileup|awk '{print $2,$4}' > pacbio.coverage
@@ -158,5 +163,5 @@

 [[img src=pacbio.coverage.png ]]

-This shows that, except for the ends of the sequence, the coverage distribution of PacBio reads is uniform. The lower coverage at the ends is expected as reads do not extend beyond the sequence ends. We thus recommend not to consider  sequence ends when testing the performance of tools using PacBio reads
+This shows that, except for the ends of the chassis, the coverage distribution of PacBio reads is uniform. The lower coverage at the ends of the chassis is expected as reads do not extend beyond the sequence ends. We thus recommend not to consider sequence ends when testing the performance of tools using PacBio reads

Validate_reads modified by Robert Kofler

Robert Kofler — Tue, 25 Jul 2017 13:33:36 -0000

--- v40
+++ v41
@@ -1,11 +1,11 @@
 [TOC]

 #Introduction
-We validate the scripts generating Illumina and PacBio reads. We show the entire protocol used for validating the scripts.  Results of this validation are highlighted in yellow
+Here, we validate the scripts generating Illumina and PacBio reads. We show the entire protocol used for validating the scripts.  Results of the validation are highlighted in yellow
 #Illumina
 Download https://sourceforge.net/projects/simulates/files/SanMiguel/chasis.fa/download

-simulate the illumina paired end reads; read length = 100; inner distance = 100 (outer distance  = 300); standard deviation of the inner distance = 20; error rate = 0.01; number of reads = 50000 (the resulting coverage should be 100); 
+Simulate Illumina paired-end reads; read length = 100; inner distance = 100 (outer distance  = 300); standard deviation of the inner distance = 20; error rate = 0.01; number of reads = 50000 (the resulting coverage should be 100); 
 ~~~~~
 python ~/dev/simulate/read_pool-seq_illumina-PE.py --pg chasis.fa --read-length 100 --inner-distance 100 --std-dev 20 --error-rate 0.01 --reads 50000 --fastq1 reads1.fq --fastq2 reads2.fq  
 # bwa 0.7.15-r1140
@@ -15,12 +15,12 @@

 ##Error rate

-We compute the error rate using Picard (v2.9.4)
+We estimate the error rate using Picard (v2.9.4)
 ~~~~~
 java -jar ~/programs/picard-2.9.4/picard.jar CollectAlignmentSummaryMetrics I= mapped.sort.bam R= chasis.fa O= sumary.metrics.txt
 ~~~~~

-Where we obtain:
+and obtain:

 ~~~~~
 ## METRICS CLASS   picard.analysis.AlignmentSummaryMetrics
@@ -38,6 +38,8 @@

 ## Coverage and read distribution

+Next we investigated the coverage. In particular we were interested whether we find the correct mean coverage and if all regions of the chassis are used as templates for reads, i.e. whether we find an even coverage within the boundaries of a Poission distribution.
+
 ~~~~~~
 samtools mpileup mapped.sort.bam > cha.mpileup
 cat cha.mpileup|awk '{print $2,$4}'> coveragedistribution.forR 
@@ -54,7 +56,9 @@
 [[img src=coverage_sizeadjust.png ]]

 ## inner distance
-We computed the inner distance using Picard (v2.9.4)
+
+Next we tested whether the inner distance of Illumina paired-ends is simulated correctly.
+We estimated the inner distance using Picard (v2.9.4)

 ~~~~~
 java -jar ~/programs/picard-2.9.4/picard.jar CollectInsertSizeMetrics I= mapped.sort.bam O= is.txt H= histo.pdf

Validate_reads modified by Robert Kofler

Robert Kofler — Mon, 17 Jul 2017 13:31:26 -0000

--- v39
+++ v40
@@ -72,11 +72,11 @@
 * **mean fragment size** simulated = 300; obtained = 299.526
 * **standard deviation of fragment size** simulated = 20; obtained = 20.084

-To test if the observed and expected inner dtistance distributions are identical, we used the following Chi-square test: 
+To test if the observed and expected fragment size distributions are identical, we extracted the fragment size (*samtools view mapped.sort.bam|awk '{print $9}'> obs-rld.txt*) and  used the following Chi-square test: 
 https://sourceforge.net/projects/simulates/files/validation_reads/chisquare.R/download

 Pearson's Chi-squared test; X-squared = 91.399, df = 164, p-value = 1
-This demonstrates that the observed and expected distribution of inner distances are identical
+This demonstrates that the observed and expected fragment size distributions are identical

 The following graph (generated by Picard) shows the fragment size distribution of the simulated paired-end reads:
 [[img src=histo_size_adj.png ]]

Validate_reads modified by Robert Kofler

Robert Kofler — Mon, 17 Jul 2017 13:29:07 -0000

--- v38
+++ v39
@@ -72,8 +72,15 @@
 * **mean fragment size** simulated = 300; obtained = 299.526
 * **standard deviation of fragment size** simulated = 20; obtained = 20.084

+To test if the observed and expected inner dtistance distributions are identical, we used the following Chi-square test: 
+https://sourceforge.net/projects/simulates/files/validation_reads/chisquare.R/download
+
+Pearson's Chi-squared test; X-squared = 91.399, df = 164, p-value = 1
+This demonstrates that the observed and expected distribution of inner distances are identical
+
 The following graph (generated by Picard) shows the fragment size distribution of the simulated paired-end reads:
 [[img src=histo_size_adj.png ]]
+

 #PacBio

Validate_reads modified by Robert Kofler

Robert Kofler — Fri, 14 Jul 2017 15:01:32 -0000

--- v37
+++ v38
@@ -105,7 +105,7 @@
 We used the following R-script to test whether the expected and observed distribution are identical https://sourceforge.net/projects/simulates/files/validation_reads/rld-ks.R/download

 
-The simulated and the observed read length distribution are not significantly different (Two-sample Kolmogorov-Smirnov test; *D=0.0074*, *p=0.63*),  demonstrating that SimulaTE accurately reproduces a given read length distribution (e.g. PacBio)
+The simulated and the observed read length distribution are not significantly different (Two-sample Kolmogorov-Smirnov test; D=0.0074, p=0.63),  demonstrating that SimulaTE accurately reproduces a given read length distribution (e.g. PacBio)

 ##Error rate

Validate_reads modified by Robert Kofler

Robert Kofler — Fri, 14 Jul 2017 15:00:41 -0000

--- v36
+++ v37
@@ -102,7 +102,10 @@
 And compared the expected (line) and observed (histogram) read length distribution in R (ggplot2)
 [[img src=rld-histo.png ]]

-This demonstrates that SimulaTE accurately reproduces a given read length distribution (e.g. PacBio)
+We used the following R-script to test whether the expected and observed distribution are identical https://sourceforge.net/projects/simulates/files/validation_reads/rld-ks.R/download
+
+
+The simulated and the observed read length distribution are not significantly different (Two-sample Kolmogorov-Smirnov test; *D=0.0074*, *p=0.63*),  demonstrating that SimulaTE accurately reproduces a given read length distribution (e.g. PacBio)

 ##Error rate

Validate_reads modified by Robert Kofler

Robert Kofler — Wed, 05 Jul 2017 08:01:46 -0000

--- v35
+++ v36
@@ -144,5 +144,5 @@

 [[img src=pacbio.coverage.png ]]

-This shows that except for the ends of the sequence that the coverage distribution of PacBio reads is uniform. When testing tools for TE identificiation with PacBio reads we recommend to avoid the ends of the sequence
+This shows that, except for the ends of the sequence, the coverage distribution of PacBio reads is uniform. The lower coverage at the ends is expected as reads do not extend beyond the sequence ends. We thus recommend not to consider  sequence ends when testing the performance of tools using PacBio reads

Validate_reads modified by Robert Kofler

Robert Kofler — Tue, 04 Jul 2017 16:46:39 -0000

--- v34
+++ v35
@@ -138,9 +138,11 @@
 cat pacbio.mpileup|awk '{print $2,$4}' > pacbio.coverage
 ~~~~~~

-* **mean coverage** expected= 1118.18; observed= 1118.358
+* **mean coverage** expected= 1118.18; observed= 1118.358

 [[img src=pacbio.coverage.png ]]

+This shows that except for the ends of the sequence that the coverage distribution of PacBio reads is uniform. When testing tools for TE identificiation with PacBio reads we recommend to avoid the ends of the sequence
+

Validate_reads modified by Robert Kofler

Robert Kofler — Tue, 04 Jul 2017 16:43:46 -0000

--- v33
+++ v34
@@ -132,5 +132,15 @@
 The small difference between the simulated and obtained indel rate may be due to difficulties of aligning reads with indels; This is for example demonstrated by the vast number of observed base substitutions (1019677). Since no base substitions were simulated the obtained ones must be an artefact of incorrect alignments. Accordingly the total error rate (indels + base substitutions) is with 9.6% almost identical to the simulated 10%.

-##Read distribution
+##Read distribution and coverage

+~~~~~~
+cat pacbio.mpileup|awk '{print $2,$4}' > pacbio.coverage
+~~~~~~
+
+* **mean coverage** expected= 1118.18; observed= 1118.358
+
+
+
+[[img src=pacbio.coverage.png ]]
+

Validate_reads modified by Robert Kofler

Robert Kofler — Tue, 04 Jul 2017 16:32:08 -0000

--- v32
+++ v33
@@ -117,18 +117,19 @@
 Therefore we used a different approach using a custom perl script: https://sourceforge.net/projects/simulates/files/validation_reads/count-error.pl/download

 ~~~~~~
-samtools mpileup -f chasis.fa map-pacbio.sort.bam.bam > pacbio.mpileup
+samtools mpileup -x -AB -Q 0 -f chasis.fa map-pacbio.sort.bam.bam > pacbio.mpileup
 perl count-error.pl pacbio.mpileup
 # Which yields:
-# Deletions 3407812
-# Insertions 3458991
-# Basesubs 1019677
-# total 82630880
+# Deletions 4370823
+# Insertions 4369478
+# Basesubs 1655095
+# total 107448200
 ~~~~~~

-* **deletions** simulated=0.05; obtained= 0.0412
-* **insertions** simulated=0.05; obtained=0.0418
-The small difference between the simulated and obtained error rate may be due to difficulties of aligning reads with indels; This is for example demonstrated by the vast number of observed base substitutions (1019677). Since no base substitions were simulated the obtained ones must be an artefact of incorrect alignments.
+* **deletions** simulated=0.05; obtained= 0.0406
+* **insertions** simulated=0.05; obtained=0.0406
+
+The small difference between the simulated and obtained indel rate may be due to difficulties of aligning reads with indels; This is for example demonstrated by the vast number of observed base substitutions (1019677). Since no base substitions were simulated the obtained ones must be an artefact of incorrect alignments. Accordingly the total error rate (indels + base substitutions) is with 9.6% almost identical to the simulated 10%.

 ##Read distribution