Recent changes to validation_fixed

validation_fixed modified by Robert Kofler

Robert Kofler — Tue, 19 Jan 2021 13:48:49 -0000

--- v22
+++ v23
@@ -91,7 +91,7 @@
 ...
 ~~~~~

-### A problem with DOC2 and DOC4
+### A small problem with DOC2 and DOC4

 We simulated a DOC4 insertion with 0% divergence (col 2 in RepeatMasker output). We realized that DOC2 has a small region that has a high similarity to DOC4 (23.1% divergence). Hence RepeatMasker annotates  overlapping DOC2 and DOC4 insertions
 ~~~~~~

validation_fixed modified by Robert Kofler

Robert Kofler — Tue, 19 Jan 2021 13:48:17 -0000

--- v21
+++ v22
@@ -67,7 +67,6 @@
 ...
 ~~~~~~
 **Note** that we can infer the query sequence from column 5 of the RepeatMasker output. In our case different query sequences refer to different individuals of a population, e.g. hg1 would be the first fly, hg2 the second fly, etc
-### A problem for two TE families

 ## Manna alignment

@@ -91,6 +90,30 @@
 DMRER1DM        5356.0  0.0     DMRER1DM        5356.0  0.0     DMRER1DM        5356.0  0.0     DMRER1DM        5356.0  0.0     DMRER1DM        5356.0  0.0
 ...
 ~~~~~
+
+### A problem with DOC2 and DOC4
+
+We simulated a DOC4 insertion with 0% divergence (col 2 in RepeatMasker output). We realized that DOC2 has a small region that has a high similarity to DOC4 (23.1% divergence). Hence RepeatMasker annotates  overlapping DOC2 and DOC4 insertions
+~~~~~~
+387   23.1  6.5  2.7  hg1        622046  622152 (1066910) + DOC2         Unspecified    269    379 (4410)   93 *
+25999    0.0  0.0  0.0  hg1        622061  624844 (1064218) + DOC4         Unspecified      1   2784    (0)   94  
+~~~~~~
+
+This does not lead to a problem with the Manna alignment. Here the example for the 5 strains.
+~~~~~~
+DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1
+DOC4    2784.0  0.0     DOC4    2784.0  0.0     DOC4    2784.0  0.0     DOC4    2784.0  0.0     DOC4    2784.0  0.0
+~~~~~~
+
+It would however result in overestimating the abundance of DOC2 in the piRNA cluster (or region of interest).  In our simple example we would estimate that the simulated DNA sequence contains three DOC2 insertions instead of the simulated two. For example:
+~~~~~~
+cat val1.manna|grep "DOC2"
+DOC2   4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0
+DOC2   107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1
+DOC2   4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0
+~~~~~~
+To avoid this problem we may exlcude sequences with a high divergence  or very short fragments of TEs. In these validations we use a maximum divergence of 5% (0% was simulated) and a minimum length of 100bp.
+

 ## Results - expected vs observed TE landscape
 To validate our approach we will now compare the expected (the pgd-file) with the observed (manna) TE composition in the population.

validation_fixed modified by Robert Kofler

Robert Kofler — Tue, 19 Jan 2021 13:35:41 -0000

--- v20
+++ v21
@@ -49,7 +49,7 @@

 We can use RepeatMasker to identify TE sequences in all 5 sequences at the same time
 ~~~~~~
-RepeatMasker -pa 5 -no_is -s -nolow -dir . -lib teseq.fasta fix.fasta
+RepeatMasker --frag 2000000 -pa 5 -no_is -s -nolow -dir . -lib teseq.fasta fix.fasta
 ~~~~~~

 Only the .out file is of interest to us <https: files="" val1-fix="" val1-fix.fasta.out="" manna="" sourceforge.net="" validation="" projects="">
@@ -67,6 +67,7 @@
 ...
 ~~~~~~
 **Note** that we can infer the query sequence from column 5 of the RepeatMasker output. In our case different query sequences refer to different individuals of a population, e.g. hg1 would be the first fly, hg2 the second fly, etc
+### A problem for two TE families

 ## Manna alignment

</https:>

validation_fixed modified by Robert Kofler

Robert Kofler — Tue, 19 Jan 2021 09:55:29 -0000

--- v19
+++ v20
@@ -1,6 +1,6 @@
 [TOC]
 # Introduction
-Here **we validate the entire pipeline for comparing sequences of piRNA clusters** (or other regions), that is we first simulate sequences (fasta), than annotate repeats (TEs) in these sequences with RepeatMasker, align the repeat-annotation with Manna, and finally compare the observed vs the expected TE landscape.
+Here **we validate the entire pipeline for comparing sequences of piRNA clusters** (or other regions of interest), i.e. we first simulate DNA sequences with known TE insertions, than annotate repeats (TEs) in these sequences with RepeatMasker, align the resulting repeat-annotation with Manna, and finally compare the observed vs the expected TE landscape.

 In more detail:
  1) We first simulate  for a particular region (e.g piRNA cluster 42AB) a population sample of 5 individuals. That is we will generate 5 fasta sequences with TE insertions.   Here we simulate **2 fixed** insertions **for 123 TE family** found in Drosophila. These sequences are simulated with our previous tool SimulaTE 
@@ -91,7 +91,7 @@
 ...
 ~~~~~

-## Results expected vs observed TE landscape
+## Results - expected vs observed TE landscape
 To validate our approach we will now compare the expected (the pgd-file) with the observed (manna) TE composition in the population.

 ~~~~~~
@@ -121,7 +121,8 @@
 [[img src=val1.mhp.png width=900]]

 ## Conclusion
-This output file (above) shows that population frequency of all TE insertions (2 of each family) was correctly estimated (i.e. 5 means fixed with 5 samples) by our appraoch (RepeatMasking + Manna).
+We simulated 246 fixed TE insertions in a population sample of 5 (expected).  We simulated 2 fixed insertions for 123 TE families found in Drosophila.
+The output file (above) shows that population frequency of all TE insertions was correctly estimated by our appraoch (i.e. RepeatMasking + Manna).
 Furthermore also the order of the TE insertions was correctly estimated (the Python script manna-vs-pgd-mhp.py would not find the observed values if the order is incorrect).

validation_fixed modified by Robert Kofler

Robert Kofler — Tue, 19 Jan 2021 09:51:59 -0000

--- v18
+++ v19
@@ -1,6 +1,7 @@
 [TOC]
 # Introduction
-Here **we validate the entire pipeline for comparing sequences of piRNA clusters** (or other regions), that is we first simulate sequences (fasta), than annotate repeats (TEs) in these sequences with RepeatMasker and than compare the repeat-annotation with Manna.
+Here **we validate the entire pipeline for comparing sequences of piRNA clusters** (or other regions), that is we first simulate sequences (fasta), than annotate repeats (TEs) in these sequences with RepeatMasker, align the repeat-annotation with Manna, and finally compare the observed vs the expected TE landscape.
+
 In more detail:
  1) We first simulate  for a particular region (e.g piRNA cluster 42AB) a population sample of 5 individuals. That is we will generate 5 fasta sequences with TE insertions.   Here we simulate **2 fixed** insertions **for 123 TE family** found in Drosophila. These sequences are simulated with our previous tool SimulaTE 
 This validation has four steps<https: sourceforge.net="" p="" wiki="" simulates="" home="">
@@ -9,12 +10,13 @@
  4)  We compare the observed with the expected results. In particular we check if all 246 TE insertions are fixed and if the order of the TE sequences is correct.

-
-# Validation
+# Validation 
 ## Data
 - the chasis, ie. a raw sequence into which TEs will be inserted by SimulaTE. The chasis has a length of 540kb. <https: files="" manna="" chasis.txt="" sourceforge.net="" validation="" projects="">
 - the TE sequences <https: files="" manna="" sourceforge.net="" teseq.fasta="" validation="" projects="">
 -  the pgd (population genome definition file) for SimulaTE  <https: files="" val1-fix="" manna="" sourceforge.net="" validation="" fix.pgd="" projects="">
+
+In addition to Manna it is alos necessary to download SimulaTE.

 ## The population genome definition
  To simulate sequences with TE insertions with SimulaTE we need a pgd (population genome definition file)
@@ -89,10 +91,38 @@
 ...
 ~~~~~

-## expected vs observed
-To validate Manna we can now compare the expected (the pgd file) with the observed (manna) TE composition in the population.
+## Results expected vs observed TE landscape
+To validate our approach we will now compare the expected (the pgd-file) with the observed (manna) TE composition in the population.
+
+~~~~~~
+python ~/dev/manna/validation/manna-vs-pgd-mhp.py --min-len 100 --max-div 5 --manna val1.manna --pgd fix.pgd > val1.mhp
+~~~~~~
+
+An example from the resulting file:
+
+~~~~~~
+...
+50086 DMZAM 5 expected
+50086 DMZAM 5 observed
+50420 DIVER2 5 expected
+50420 DIVER2 5 observed
+...
+~~~~~~
+Which has the position of the TE (col 1) the family (col 2) the population count of the TE (col 3) and a flag indicating whether the count refers to the observed (Manna) or expected (pgd) value.
+
+Finally this file can be visualized with R
+
+~~~~~~
+R --vanilla --args val1.mhp < ~/dev/manna/validation/manhatten.R 
+~~~~~~
+
+Here is the output file

 [[img src=val1.mhp.png width=900]]
+
+## Conclusion
+This output file (above) shows that population frequency of all TE insertions (2 of each family) was correctly estimated (i.e. 5 means fixed with 5 samples) by our appraoch (RepeatMasking + Manna).
+Furthermore also the order of the TE insertions was correctly estimated (the Python script manna-vs-pgd-mhp.py would not find the observed values if the order is incorrect).



</https:></https:></https:></https:>

validation_fixed modified by Robert Kofler

Robert Kofler — Tue, 19 Jan 2021 09:41:27 -0000

--- v17
+++ v18
@@ -1,5 +1,7 @@
 [TOC]
 # Introduction
+Here **we validate the entire pipeline for comparing sequences of piRNA clusters** (or other regions), that is we first simulate sequences (fasta), than annotate repeats (TEs) in these sequences with RepeatMasker and than compare the repeat-annotation with Manna.
+In more detail:
  1) We first simulate  for a particular region (e.g piRNA cluster 42AB) a population sample of 5 individuals. That is we will generate 5 fasta sequences with TE insertions.   Here we simulate **2 fixed** insertions **for 123 TE family** found in Drosophila. These sequences are simulated with our previous tool SimulaTE 
 This validation has four steps<https: sourceforge.net="" p="" wiki="" simulates="" home="">
  2)  We run Repeatmasker to annotate the TEs in these 5 sequences.
</https:>

validation_fixed modified by Robert Kofler

Robert Kofler — Tue, 19 Jan 2021 09:38:51 -0000

--- v16
+++ v17
@@ -90,7 +90,7 @@
 ## expected vs observed
 To validate Manna we can now compare the expected (the pgd file) with the observed (manna) TE composition in the population.

-[[img src=val1.mhp.png width=700]]
+[[img src=val1.mhp.png width=900]]

validation_fixed modified by Robert Kofler

Robert Kofler — Tue, 19 Jan 2021 09:38:25 -0000

--- v15
+++ v16
@@ -90,7 +90,7 @@
 ## expected vs observed
 To validate Manna we can now compare the expected (the pgd file) with the observed (manna) TE composition in the population.

-[[img src=val1.mhp.png width=100]]
+[[img src=val1.mhp.png width=700]]

validation_fixed modified by Robert Kofler

Robert Kofler — Tue, 19 Jan 2021 09:37:56 -0000

--- v14
+++ v15
@@ -90,7 +90,7 @@
 ## expected vs observed
 To validate Manna we can now compare the expected (the pgd file) with the observed (manna) TE composition in the population.

-[[img src=val1.mhp.png]]
+[[img src=val1.mhp.png width=100]]

validation_fixed modified by Robert Kofler

Robert Kofler — Tue, 19 Jan 2021 09:37:01 -0000

--- v13
+++ v14
@@ -90,6 +90,9 @@
 ## expected vs observed
 To validate Manna we can now compare the expected (the pgd file) with the observed (manna) TE composition in the population.

+[[img src=val1.mhp.png]]



+
+