Recent changes to Simulation tests

Simulation tests modified by Florian Schwarz

Florian Schwarz — Tue, 14 Dec 2021 09:27:01 -0000

--- v5
+++ v6
@@ -85,5 +85,6 @@

 Actual HGDP Datasets
 ---
-Third step of simulation studies: Using a few actual datasets
-ADD TO SOURCEFORGE REPOSITORY
+To further test the reliability of the set of SCGs, we scanned the results of three HGDP datasets before starting the actual pipeline.
+These files were produced in the same way as files for the full HGDP pipeline (see section 'Analysis Pipeline').
+The three files used for this analysis are stored in the folder 'realdatatest'.

Simulation tests modified by Florian Schwarz

Florian Schwarz — Thu, 09 Dec 2021 10:59:09 -0000

--- v4
+++ v5
@@ -30,7 +30,7 @@
 For this purpose we use SimulaTE.
 See the SimulaTE walkthrough (https://sourceforge.net/p/simulates/wiki/Home/) for an in-detail explanation of the software.
 In this application, a 4 MB chassis sequence was used (FILENAME, WHERE PROVIDED?).
-In this chassis, we randomly insert 1 copy of each SCG and KRAB-ZNF and 5 copies of each TE from the reference sequence library using a modified version of the ‘define-landscape’ script  (‘define-landscape-flo’).
+In this chassis, we randomly insert 1 copy of each SCG and KRAB-ZNF and 5 copies of each TE from the reference sequence library using a modified version of the ‘define-landscape’ script  (‘define-landscape-flo’) which can be found in the simulaTE sourceforge repository at https://sourceforge.net/p/simulates/code/HEAD/tree/tmp/ .

 Commands:
 Define landscape:
@@ -45,7 +45,7 @@
 ~~~
 python2.7 create-reads-for-human.py --fasta genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1.pg --coverage 35 --read-length 150 --method uniform --error-rate 0 --output genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1_cov35_rl150_uniform_error0.fastq
 ~~~
-The script 'create-reads-for-human.py' is modified from its original simulaTE version and is available in the svn repository of this project in the folder 'scripts'.
+The script 'create-reads-for-human.py' is modified from its original simulaTE version and is available in the svn repository of the Human TE project (i.e. this project) in the folder 'scripts'.

 Map artificial reads to reference library:
 ~~~
@@ -68,6 +68,20 @@
 In these simulations, we no longer use an artificial genome with a known number of TE, SCG and Krab insertions to create the artificial reads. Instead, we use the actual human genome with its respective number of insertions. The main focus here was to identify SCGs that are not consistently estimated with a copy number of ~1.
 All artificial reads were created solely with a coverage of 5x as the previous simulations have revealed no influences of coverage fluctuations on copy number estimates.

+Create artificial reads from human reference genome (for this example, we use the default parameters for this analysis, i.e. 5x coverage, 150bp read length, uniform sampling distribution and error rate of 0):
+~~~
+python2.7 human-te-dynamics-svn/scripts/create-reads-for-human.py --fasta human-data/simulation-tests/humangenome-artificialreads/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --coverage 5 --read-length 50 --method uniform --output hg38_cov5x_rl50_noerror_uniform.fastq&
+~~~
+Map to reference library:
+~~~
+bwa mem -t 20 human-te-dynamics-svn/refg/reflibrary_humans_v6.2.fasta hg38_cov5x_rl50_noerror_uniform.fastq > map_hg38_cov5x_rl50_noerror_uniform_to_reflib_v6.2.sam&
+~~~
+Calculate average copy number, i.e. create the mapstat file:
+~~~
+python3 human-te-dynamics-svn/scripts/humante-mapstat-weight.py --sam map_hg38_cov5x_rl50_noerror_uniform_to_reflib_v6.2.sam --min-mq 10 --fai human-te-dynamics-svn/refg/reflibrary_humans_v6.2.fasta.fai > mapstat_hg38_cov5x_rl50_noerror_uniform_to_reflib_v6.2_weighted.txt&
+~~~
+
+To see how we analyzed the data from the simulation, see R markdown documentation available in the 'simulation-tests' directory.

 Actual HGDP Datasets
 ---

Simulation tests modified by Florian Schwarz

Florian Schwarz — Tue, 07 Dec 2021 13:01:07 -0000

--- v3
+++ v4
@@ -10,20 +10,21 @@

 The Idea of these simulations was to test which parameters for sequencing input data could have an influence on abundance differences using the established reference library.
 The simulations were also used iteratively to identify SCG sequences vulnerable to coverage estimate fluctuations caused by changes in sequencing data parameters.
+Relevant files are found in the folder 'simulation-tests' in the data repository.

 Artificial Genome
 ----
 The first step of the simulations aims to test the accuracy of copy number estimation for all sequences by using a simulated genome into which SCG, TE and KRAB sequences are inserted with a fixed copy number.
 Thus, we have a ‘true’ value which can be compared to the copy number estimate to infer the quality of estimation for sequences and identify sequences vulnerable to fluctuations due to differences in the input data.

-We fluctuate the following parameter (default parameters are shown in bold):
+For creating the artificial reads, we fluctuate the following parameter (default parameters are shown in bold):
 --coverage: 5, 35,75
 --read-length: 50, 100, 150, 200
 --method: uniform, random
 --error-rate 0, 0.01,0.05

-THESE PARAMETERS MEAN THE FOLLOWING:
-
+A uniform sampling method means that each position in the genome has equal coverage, while random introduces a random effect and thus different positions in the genome can have a different coverage.
+An error rate of 0.01 implies that 1 out of every 100 bases is randomly mutated.

 First, we need to create the simulated genome.
 For this purpose we use SimulaTE.
@@ -44,20 +45,22 @@
 ~~~
 python2.7 create-reads-for-human.py --fasta genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1.pg --coverage 35 --read-length 150 --method uniform --error-rate 0 --output genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1_cov35_rl150_uniform_error0.fastq
 ~~~
+The script 'create-reads-for-human.py' is modified from its original simulaTE version and is available in the svn repository of this project in the folder 'scripts'.
+
 Map artificial reads to reference library:
 ~~~
 bwa mem -t 20 reflibrary_humans_v6.2.fasta genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1_cov35_rl150_uniform_error0.fastq >map_cov35_rl150_uniform_error0_to_reflib_v6.2.sam
 ~~~
 The majority of the artificial reads and mapping files are omitted from the sourceforge repository due to their size. However, one of each is included for demonstration and troubleshooting purposes.

-
 Make mapstat file:
 ~~~
-python3 human-te-dynamics-svn/humante-mapstat-weight.py --min-mq 10 --fai reflibrary_humans_v6.2.fasta.fai --sam map_cov35_rl150_uniform_error0_to_reflib_v6.2.sam >mapstat_cov35_rl150_uniform_error0_to_reflib_v6.2_weighted.txt
+python3 humante-mapstat-weight.py --min-mq 10 --fai reflibrary_humans_v6.2.fasta.fai --sam map_cov35_rl150_uniform_error0_to_reflib_v6.2.sam >mapstat_cov35_rl150_uniform_error0_to_reflib_v6.2_weighted.txt
 ~~~

+This script is incorporated in the full analysis pipeline and is explained in more detail in the 'Analysis pipeline' section of this Wiki.

-To see how we analyzed the data from the simulation, see R markdown documentation (PATH).
+To see how we analyzed the data from the simulation, see R markdown documentation available in the 'simulation-tests' directory.


 Human reference genome

Simulation tests modified by Florian Schwarz

Florian Schwarz — Tue, 07 Dec 2021 10:51:12 -0000

--- v2
+++ v3
@@ -16,11 +16,11 @@
 The first step of the simulations aims to test the accuracy of copy number estimation for all sequences by using a simulated genome into which SCG, TE and KRAB sequences are inserted with a fixed copy number.
 Thus, we have a ‘true’ value which can be compared to the copy number estimate to infer the quality of estimation for sequences and identify sequences vulnerable to fluctuations due to differences in the input data.

-We fluctuate the following parameter (default parameters are shown in *bold*):
---coverage: 5, *35*,75
---read-length: 50, 100, *150*, 200
---method: *uniform*, random
---error-rate *0*, 0.01,0.05
+We fluctuate the following parameter (default parameters are shown in bold):
+--coverage: 5, 35,75
+--read-length: 50, 100, 150, 200
+--method: uniform, random
+--error-rate 0, 0.01,0.05

 THESE PARAMETERS MEAN THE FOLLOWING:

Simulation tests modified by Florian Schwarz

Florian Schwarz — Tue, 07 Dec 2021 10:50:11 -0000

--- v1
+++ v2
@@ -16,11 +16,11 @@
 The first step of the simulations aims to test the accuracy of copy number estimation for all sequences by using a simulated genome into which SCG, TE and KRAB sequences are inserted with a fixed copy number.
 Thus, we have a ‘true’ value which can be compared to the copy number estimate to infer the quality of estimation for sequences and identify sequences vulnerable to fluctuations due to differences in the input data.

-We fluctuate the following parameter (default parameters are marked with (d)):
---coverage: 5, 35(d),75
---read-length: 50, 100, 150(d), 200
---method: uniform(d), random
---error-rate 0(d), 0.01,0.05
+We fluctuate the following parameter (default parameters are shown in *bold*):
+--coverage: 5, *35*,75
+--read-length: 50, 100, *150*, 200
+--method: *uniform*, random
+--error-rate *0*, 0.01,0.05

 THESE PARAMETERS MEAN THE FOLLOWING:

Simulation tests modified by Florian Schwarz

Florian Schwarz — Tue, 07 Dec 2021 10:48:19 -0000

Simulation tests

Requirements:
SimulaTE
bwa
samtools
Python2
Python3

The Idea of these simulations was to test which parameters for sequencing input data could have an influence on abundance differences using the established reference library.
The simulations were also used iteratively to identify SCG sequences vulnerable to coverage estimate fluctuations caused by changes in sequencing data parameters.

Artificial Genome

The first step of the simulations aims to test the accuracy of copy number estimation for all sequences by using a simulated genome into which SCG, TE and KRAB sequences are inserted with a fixed copy number.
Thus, we have a ‘true’ value which can be compared to the copy number estimate to infer the quality of estimation for sequences and identify sequences vulnerable to fluctuations due to differences in the input data.

We fluctuate the following parameter (default parameters are marked with (d)):
--coverage: 5, 35(d),75
--read-length: 50, 100, 150(d), 200
--method: uniform(d), random
--error-rate 0(d), 0.01,0.05

THESE PARAMETERS MEAN THE FOLLOWING:

First, we need to create the simulated genome.
For this purpose we use SimulaTE.
See the SimulaTE walkthrough (https://sourceforge.net/p/simulates/wiki/Home/) for an in-detail explanation of the software.
In this application, a 4 MB chassis sequence was used (FILENAME, WHERE PROVIDED?).
In this chassis, we randomly insert 1 copy of each SCG and KRAB-ZNF and 5 copies of each TE from the reference sequence library using a modified version of the ‘define-landscape’ script (‘define-landscape-flo’).

Commands:
Define landscape:

python2.7 define-landscape-flo.py --chassis chassis_4MB.fasta --te-seqs reflibrary_humans_v6.2.fasta --output genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1.pgd --count-te 5 --count-krab 1

Build genome:

python2.7 build-population-genome.py --pgd genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1.txt --chassis chassis_4MB.fasta --te-seqs reflibrary_humans_v6.2.fasta --output genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1.pg

Create artificial reads:

python2.7 create-reads-for-human.py --fasta genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1.pg --coverage 35 --read-length 150 --method uniform --error-rate 0 --output genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1_cov35_rl150_uniform_error0.fastq

Map artificial reads to reference library:

bwa mem -t 20 reflibrary_humans_v6.2.fasta genome_simulate_4MBchassis_reflib_v6.2_countte5_countkrab1_cov35_rl150_uniform_error0.fastq >map_cov35_rl150_uniform_error0_to_reflib_v6.2.sam

The majority of the artificial reads and mapping files are omitted from the sourceforge repository due to their size. However, one of each is included for demonstration and troubleshooting purposes.

Make mapstat file:

python3 human-te-dynamics-svn/humante-mapstat-weight.py --min-mq 10 --fai reflibrary_humans_v6.2.fasta.fai --sam map_cov35_rl150_uniform_error0_to_reflib_v6.2.sam >mapstat_cov35_rl150_uniform_error0_to_reflib_v6.2_weighted.txt

To see how we analyzed the data from the simulation, see R markdown documentation (PATH).

Human reference genome

In these simulations, we no longer use an artificial genome with a known number of TE, SCG and Krab insertions to create the artificial reads. Instead, we use the actual human genome with its respective number of insertions. The main focus here was to identify SCGs that are not consistently estimated with a copy number of ~1.
All artificial reads were created solely with a coverage of 5x as the previous simulations have revealed no influences of coverage fluctuations on copy number estimates.

Actual HGDP Datasets

Third step of simulation studies: Using a few actual datasets
ADD TO SOURCEFORGE REPOSITORY