Recent changes to Validation_Pop2

Validation_Pop2 modified by Robert Kofler

Robert Kofler — Mon, 09 Oct 2017 08:42:35 -0000

--- v28
+++ v29
@@ -13,7 +13,7 @@

  * chassis: https://sourceforge.net/projects/simulates/files/validation_pop2/chasis1M.fasta/download
  * TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/teseq-clean-ml100noS4.fasta/download
- * TE hierarchy; required for PoPoolationTE2 (assigns every fasta entry of a TE to a family) https://sourceforge.net/projects/simulates/files/validation_pop2/tehier-ml100noS4.fasta/download
+ * TE hierarchy; required for PoPoolationTE2 (assigns every fasta entry of a TE to a family) https://sourceforge.net/projects/simulates/files/validation_pop2/tehier-ml100noS4.txt/download

 # Validation of SimulaTE 
@@ -73,12 +73,12 @@
 *Identifying TEs with PoPoolationTE2*

 ~~~~~~
-java -jar popte2.jar ppileup --bam reads.sort.bam --map-qual 15 --hier tehier-ml100noS4.fasta --output pp.gz
+java -jar popte2.jar ppileup --bam reads.sort.bam --map-qual 15 --hier tehier-ml100noS4.txt --output pp.gz
 java -jar popte2.jar identifySignatures --ppileup pp.gz --mode separate --min-count 2 --output te.signatures
-java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar updatestrand --bam reads.sort.bam --signature te.signatures --output testrand.signatures --hier tehier-ml100noS4.fasta --map-qual 15 --max-disagreement 0.4
+java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar updatestrand --bam reads.sort.bam --signature te.signatures --output testrand.signatures --hier tehier-ml100noS4.txt --map-qual 15 --max-disagreement 0.4
 java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar frequency --ppileup pp.gz --signature testrand.signatures --output te.freqsignatures
 java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar filterSignatures --input te.freqsignatures --output tefiltered.freqsignatures --max-otherte-count 2 --max-structvar-count 2
-java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar pairupsignatures --signature tefiltered.freqsignatures --ref-genome chasis_te.fasta --hier tehier-ml100noS4.fasta --min-distance -200 --max-distance 300 --output tes.filtered.txt
+java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar pairupsignatures --signature tefiltered.freqsignatures --ref-genome chasis_te.fasta --hier tehier-ml100noS4.txt --min-distance -200 --max-distance 300 --output tes.filtered.txt
 ~~~~~~

 ## Expected TE landscape
@@ -92,7 +92,7 @@
 compute the expected TE landscape

 ~~~~~
-python statistics_TEfrequencies-pooled.py --pgd landscape01-09.pgd --hier tehier-ml100noS4.fasta > expected.txt
+python statistics_TEfrequencies-pooled.py --pgd landscape01-09.pgd --hier tehier-ml100noS4.txt > expected.txt
 ~~~~~

 the first ten lines of *expected.txt*

Validation_Pop2 modified by Robert Kofler

Robert Kofler — Tue, 25 Jul 2017 13:22:39 -0000

--- v27
+++ v28
@@ -4,22 +4,21 @@

 Here, we validate SimulaTE. We generate a complex TE landscape with SimulaTE and test whether the observed TE landscape (using PoPoolationTE2) agrees with the simulated one.

+We simulate a TE landscape for a population of N=100 (haploid individuals) and a genome of size 1MB. The simulated TE insertions have random i) position ii) strand iii) family and iv)  population frequency (within a range of 0.1 and 0.9).

-We simulate a TE landscape for a population of N=100 (haploids) and a genome of size 1MB. The simulated TE insertions have random i) position ii) strand iii) family and iv)  population frequency (within a range of 0.1 and 0.9).
-
-Subsequently, we simulate paired end reads with a read length of 100 and an inner distance of about 100.  Next we identify TE insertions using PoPoolation TE2 and the simulated reads. Finally we ask whether the simulated TE insertions (expected) agree with the TEs identified by PoPoolationTE2 (observed).
+Subsequently, we simulate Illumina paired-end reads with a read length of 100 and an inner distance of about 100.  Next we identify TE insertions using PoPoolation TE2 and the simulated reads. Finally we ask whether the simulated TE insertions (expected) agree with the TEs identified by PoPoolationTE2 (observed).

 ## Prerequisites
 Download the following two files:

- * chasis: https://sourceforge.net/projects/simulates/files/validation_pop2/chasis1M.fasta/download
+ * chassis: https://sourceforge.net/projects/simulates/files/validation_pop2/chasis1M.fasta/download
  * TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/teseq-clean-ml100noS4.fasta/download
- * TE hierarchy; required for PoPoolationTE2 (assigns every fasta entry of a TE to a  family) https://sourceforge.net/projects/simulates/files/validation_pop2/tehier-ml100noS4.fasta/download
+ * TE hierarchy; required for PoPoolationTE2 (assigns every fasta entry of a TE to a family) https://sourceforge.net/projects/simulates/files/validation_pop2/tehier-ml100noS4.fasta/download

 # Validation of SimulaTE 
 ## Generate a pgd-file 
-First we generate a pgd-file (population genome description; see manual). The resulting pgd-file specifies a TE landscape with 1000 TE insertions having random frequency (between 0.1 and 0.9), position and strand. 
+First we generate a pgd-file (population genome description; see manual). The resulting pgd-file specifies a TE landscape with 1000 TE insertions having random family, population frequency (between 0.1 and 0.9), position and strand. 

 ~~~~~~
 python define-landscape_random-insertions-freq-range.py --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --N 100 --insert-count 1000 --min-distance 700 --min-freq 0.1 --max-freq 0.9 --output landscape01-09.pgd
@@ -29,7 +28,7 @@

 ## Create the population genome

-Based on the pgd file we can now create the population genome (see manual):
+Based on the pgd file we create the population genome:

 ~~~~~
 python build-population-genome.py --pgd landscape01-09.pgd --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --output landscape01-09.pg   
@@ -58,6 +57,8 @@

 ## Observed TE landscape (with PoPoolationTE2)
 PoPoolationTE2 can be obtained here https://sourceforge.net/projects/popoolation-te2/
+
+For details on TE identification with PoPoolationTE2 see: https://sourceforge.net/p/popoolation-te2/wiki/Home/

 *Mapping*
 ~~~~~~
@@ -109,7 +110,7 @@
 8213   mdg3    -   0.85
 ~~~~~

-The four fields from left to right are: position,  family, strand  and population frequency of the TE insertion
+The four fields from left to right are: position, family, strand and population frequency of the TE insertion

 ## compare the expected and the observed TE landscape
 Download
@@ -127,7 +128,7 @@
 # Results
 Either use your *results-exp-vs-obs.txt* or download our results https://sourceforge.net/projects/simulates/files/validation_pop2/results-exp-vs-obs.txt/download

-A head of *results-exp-vs-obs.txt* gives
+*head results-exp-vs-obs.txt* gives:

 ~~~~~~
 FOUND  HMS-Beagle  856 844 +   +   0.46    0.53    12.0    0.07    FR
@@ -142,10 +143,12 @@
 FOUND  mdg3    8213    8207    -   -   0.85    0.874   6.0 0.024   FR
 ~~~~~~

-The fields are from left to right: found/missed/wrong, family, expected position, observed position, expected strand, observed strand, expected population frequency, observed population frequency, delta position (position difference), delta pop. freq., FR/R/F
+The fields from left to right are: found/missed/wrong, family, expected position, observed position, expected strand, observed strand, expected population frequency, observed population frequency, delta position (position difference), delta pop. freq., FR/R/F

 ## Summary
+
+Here we compute a summary of the results.

 Download https://sourceforge.net/projects/simulates/files/validation_pop2/summary-expobs.py/download

@@ -162,16 +165,17 @@
 ~~~~~

 This tells us that:
- * All 1000 simulated TE insertions where found (correct position +-200bp and family)
- * one additional TE insertion was found (*NOF*; at low frequency)
- * for 999 TE insertions the strand was correctly estimated
- * the position was estimated with an average accuracy of 8bp and a standard deviation of 7bp
- * the population frequency was estimated with an accuracy of 3% and a standard deviation of 2.3%

-**Note** that the slight deviations from the expected value are most likely the outcome of i) paired-end sampling variation and ii) PoPoolationTE2 (identifiying TE insertions from Pool-Seq data is a challenge).
+* All 1000 simulated TE insertions where identified (correct position +-200bp and family)
+* one additional TE insertion was found (*NOF*; at low frequency)
+* for 999 TE insertions the strand was correctly estimated
+* the average accuracy of the estimated TE positionis  8bp and has a standard deviation of 7bp
+* the average accuracy of the estimated  population frequency is 3% and has a standard deviation of 2.3%
+
+**Note** that the slight deviations from the expected value are most likely the outcome of i) paired-end sampling variation and ii) inaccuracies of the approache for TE identification (identifiying TE insertions from Pool-Seq data is a challenge).

 ## Frequency distribution
-In the following graph we show the correlation between the expected (simulated with SimulaTE) and the observed  population frequency (estimated with PoPooaltionTE2) of the TE insertions.
+In the following graph we show the correlation between the expected (simulated) and the observed  population frequency (estimated with PoPooaltionTE2) of the TE insertions.

 [[img src=cor.png]]
 #Conclusion:

Validation_Pop2 modified by Robert Kofler

Robert Kofler — Thu, 06 Jul 2017 11:21:45 -0000

--- v26
+++ v27
@@ -2,13 +2,12 @@

 #Introduction#

-Here, we validate SimulaTE by first generating a complex TE landscape with SimulaTE and than testing wheter the TE landscape identified with a different tool agrees with the simulated one.
- Basically we are comparing the expected with the observed TE landscape.
+Here, we validate SimulaTE. We generate a complex TE landscape with SimulaTE and test whether the observed TE landscape (using PoPoolationTE2) agrees with the simulated one.

-We simulated a TE landscape for a population of N=100 (haploids) and a genome of size 1MB. We simulated TE insertions having random i) position ii) strand iii) family and iv)  population frequency (within a range of 0.1 and 0.9).
-For this TE landscape 
-Next, we simulated paired end reads with a read length of 100 and an inner distance of about 100.  We identify TE insertions using PoPoolation TE2 and the simulated reads. Finally we ask whether the simulated TEs (expected) agree with the ones identified by PoPoolationTE2 (observed).
+We simulate a TE landscape for a population of N=100 (haploids) and a genome of size 1MB. The simulated TE insertions have random i) position ii) strand iii) family and iv)  population frequency (within a range of 0.1 and 0.9).
+
+Subsequently, we simulate paired end reads with a read length of 100 and an inner distance of about 100.  Next we identify TE insertions using PoPoolation TE2 and the simulated reads. Finally we ask whether the simulated TE insertions (expected) agree with the TEs identified by PoPoolationTE2 (observed).

 ## Prerequisites
 Download the following two files:

Validation_Pop2 modified by Robert Kofler

Robert Kofler — Wed, 28 Jun 2017 07:55:49 -0000

--- v25
+++ v26
@@ -2,9 +2,13 @@

 #Introduction#

-We validate SimulaTE by testing wheter we can identify the simulated TEs. 
-We simulate a TE landscape for a population of N=100 (haploids) and a genome of size 1MB. We simulate TE insertions having i) random family and ii) random population frequency, between 0.1 and 0.9.
-Next we simulate paired end reads with a read length of 100 and an inner distance of about 100.  We identify TE insertions using PoPoolation TE2 and the simulated reads. Finally we ask whether the simulated TEs (expected) agree with the ones identified by PoPoolationTE2 (observed).
+Here, we validate SimulaTE by first generating a complex TE landscape with SimulaTE and than testing wheter the TE landscape identified with a different tool agrees with the simulated one.
+ Basically we are comparing the expected with the observed TE landscape.
+
+
+We simulated a TE landscape for a population of N=100 (haploids) and a genome of size 1MB. We simulated TE insertions having random i) position ii) strand iii) family and iv)  population frequency (within a range of 0.1 and 0.9).
+For this TE landscape 
+Next, we simulated paired end reads with a read length of 100 and an inner distance of about 100.  We identify TE insertions using PoPoolation TE2 and the simulated reads. Finally we ask whether the simulated TEs (expected) agree with the ones identified by PoPoolationTE2 (observed).

 ## Prerequisites
 Download the following two files:

Validation_Pop2 modified by Robert Kofler

Robert Kofler — Tue, 27 Jun 2017 13:21:11 -0000

--- v24
+++ v25
@@ -106,7 +106,7 @@
 8213   mdg3    -   0.85
 ~~~~~

-The four fields from left to right are position  family, strand  and population frequency of the TE insertion
+The four fields from left to right are: position,  family, strand  and population frequency of the TE insertion

 ## compare the expected and the observed TE landscape
 Download

Validation_Pop2 modified by Robert Kofler

Robert Kofler — Tue, 27 Jun 2017 13:19:44 -0000

--- v23
+++ v24
@@ -78,7 +78,7 @@
 ~~~~~~

 ## Expected TE landscape
-In order to compare the expected and the observed TE landscape we first summarize the pgd-file. We are only interested in the position, family, strand and population frequency of TEs. 
+In order to compare the expected and the observed TE landscape we require a summary of the pgd-file.

 Download 

@@ -91,7 +91,7 @@
 python statistics_TEfrequencies-pooled.py --pgd landscape01-09.pgd --hier tehier-ml100noS4.fasta > expected.txt
 ~~~~~

-the first ten lines of*expected.txt*
+the first ten lines of *expected.txt*

 ~~~~~
 856    HMS-Beagle  +   0.46
@@ -106,7 +106,7 @@
 8213   mdg3    -   0.85
 ~~~~~

-The four columns are position  family, strand  and population frequency of the TE insertion
+The four fields from left to right are position  family, strand  and population frequency of the TE insertion

 ## compare the expected and the observed TE landscape
 Download

Validation_Pop2 modified by Robert Kofler

Robert Kofler — Tue, 27 Jun 2017 13:16:56 -0000

--- v22
+++ v23
@@ -78,6 +78,7 @@
 ~~~~~~

 ## Expected TE landscape
+In order to compare the expected and the observed TE landscape we first summarize the pgd-file. We are only interested in the position, family, strand and population frequency of TEs. 

 Download

Validation_Pop2 modified by Robert Kofler

Robert Kofler — Tue, 27 Jun 2017 13:06:57 -0000

Validation_Pop2 modified by Robert Kofler

Robert Kofler — Tue, 27 Jun 2017 13:04:47 -0000

--- v20
+++ v21
@@ -167,13 +167,13 @@
 **Note** that the slight deviations from the expected value are most likely the outcome of i) paired-end sampling variation and ii) PoPoolationTE2 (identifiying TE insertions from Pool-Seq data is a challenge).

 ## Frequency distribution
-In the following graph we show the correlation between the expected (simulated with SimulaTE) and the observed (estimated with PoPooaltionTE2) population frequency of the TE insertions.
+In the following graph we show the correlation between the expected (simulated with SimulaTE) and the observed  population frequency (estimated with PoPooaltionTE2) of the TE insertions.

 [[img src=cor.png]]
 #Conclusion:
-We demonstrated that SimulaTE accurately simulates a complex TE landscape with 1000 random TE insertions.
+We demonstrated that SimulaTE accurately simulates  complex TE landscapes having for example 1000 random TE insertions.

-Specificiall we showed that the following key properties are accurately simulated
+Specificially we showed that the following key properties are accurately simulated

 * positions of TE insertions
 * strand of TE insertions

Validation_Pop2 modified by Robert Kofler

Robert Kofler — Tue, 27 Jun 2017 13:02:41 -0000