Here, we validate the scripts generating Illumina and PacBio reads. We show the entire protocol used for validating the scripts.
Results of the validation are highlighted in yellow
Download https://sourceforge.net/projects/simulates/files/SanMiguel/chasis.fa/download
Simulate Illumina paired-end reads; read length = 100; inner distance = 100 (outer distance = 300); standard deviation of the inner distance = 20; error rate = 0.01; number of reads = 50000 (the resulting coverage should be 100);
python ~/dev/simulate/read_pool-seq_illumina-PE.py --pg chasis.fa --read-length 100 --inner-distance 100 --std-dev 20 --error-rate 0.01 --reads 50000 --fastq1 reads1.fq --fastq2 reads2.fq
# bwa 0.7.15-r1140
bwa mem chasis.fa reads1.fq reads2.fq|samtools view -Sb - | samtools sort - mapped.sort
We estimate the error rate using Picard (v2.9.4)
java -jar ~/programs/picard-2.9.4/picard.jar CollectAlignmentSummaryMetrics I= mapped.sort.bam R= chasis.fa O= sumary.metrics.txt
and obtain:
## METRICS CLASS picard.analysis.AlignmentSummaryMetrics
CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS PF_READS_IMPROPER_PAIRS PCT_PF_READS_IMPROPER_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER SAMPLE LIBRARY READ_GROUP
FIRST_OF_PAIR 50000 50000 1 0 50000 1 4999527 50000 4999527 4999527 1 0.009905 0.009905 0 100 50000 1 0 0 0 1 0 0
SECOND_OF_PAIR 50000 50000 1 0 50000 1 4999544 50000 4999544 4999544 1 0.00993 0.00993 0 100 50000 1 0 0 0 0 0 0
PAIR 100000 100000 1 0 100000 1 9999071 100000 9999071 9999071 1 0.009917 0.009917 0 100 100000 1 0 0 0 0.5 0 0
error rate simulated = 0.01; obtained = 0.099
Next we investigated the coverage. In particular we were interested whether we find the correct mean coverage and if all regions of the chassis are used as templates for reads, i.e. whether we find an even coverage within the boundaries of a Poission distribution.
samtools mpileup mapped.sort.bam > cha.mpileup
cat cha.mpileup|awk '{print $2,$4}'> coveragedistribution.forR
## the following results were than obtained in R
mean coverage simulated = 100; obtained = 99.9947
standard deviation coverage simulated (Poisson) = sqrt(100) = 10; obtained = 10.94
The following graph shows the coverage distribution along the chassis (100kb length).
Next we tested whether the inner distance of Illumina paired-ends is simulated correctly.
We estimated the inner distance using Picard (v2.9.4)
java -jar ~/programs/picard-2.9.4/picard.jar CollectInsertSizeMetrics I= mapped.sort.bam O= is.txt H= histo.pdf
which yields the following output
## METRICS CLASS picard.analysis.InsertSizeMetrics
MEDIAN_INSERT_SIZE MEDIAN_ABSOLUTE_DEVIATION MIN_INSERT_SIZE MAX_INSERT_SIZE MEAN_INSERT_SIZE STANDARD_DEVIATION READ_PAIRS PAIR_ORIENTATION WIDTH_OF_10_PERCENT WIDTH_OF_20_PERCENT WIDTH_OF_30_PERCENT WIDTH_OF_40_PERCENT WIDTH_OF_50_PERCENT WIDTH_OF_60_PERCENT WIDTH_OF_70_PERCENT WIDTH_OF_80_PERCENT WIDTH_OF_90_PERCENT WIDTH_OF_99_PERCENT SAMPLE LIBRARY READ_GROUP
299 14 217 381 299.52654 20.084087 50000 FR 7 11 17 23 29 35 43 53 67 105
mean fragment size simulated = 300; obtained = 299.526
standard deviation of fragment size simulated = 20; obtained = 20.084
To test if the observed and expected fragment size distributions are identical, we extracted the fragment size (samtools view mapped.sort.bam|awk '{print $9}'> obs-rld.txt) and used the following Chi-square test:
https://sourceforge.net/projects/simulates/files/validation_reads/chisquare.R/download
Pearson's Chi-squared test; X-squared = 91.399, df = 164, p-value = 1
This demonstrates that the observed and expected fragment size distributions are identical
The following graph (generated by Picard) shows the fragment size distribution of the simulated paired-end reads:
PacBio reads frequently show a bimodal length distribution. SimulaTE allows to specify a read length distribution, with simulated reads following this distribution.
In this validation we use the following read length distribution:
the file can be obtained here https://sourceforge.net/projects/simulates/files/validation_reads/rld.txt/download
We simulate PacBio reads:
python ~/dev/simulate/read_pool-seq_pacbio.py --pg chasis.fa --rld-file rld.txt --error-rate 0.1 --reads 10000 --fasta reads.fa
We simulated 10k reads with an error rate of 0.1 (50% insertions and 50% deletions; this value may be adjusted by the user).
We computed the read length distribution:
samtools faidx reads.fa
cat reads.fa.fai|awk '{print $2}' > reads-lengs.txt
And compared the expected (line) and observed (histogram) read length distribution in R (ggplot2)
We used the following R-script to test whether the expected and observed distribution are identical https://sourceforge.net/projects/simulates/files/validation_reads/rld-ks.R/download
The simulated and the observed read length distribution are not significantly different (Two-sample Kolmogorov-Smirnov test; D=0.0074, p=0.63), demonstrating that SimulaTE accurately reproduces a given read length distribution
First we need to map the reads. We use sensitive search parameters and force the reads to map end-to-end (semi-global alignments)
bwa mem -k14 -W30 -r10 -A1 -B1 -O1 -E1 -L100,100 chasis.fa reads-n4.fa |samtools view -Sb - |samtools sort - map-pacbio.sort.bam
Unfortunatelly Picard decided not to process these data (it crashed with ArrayIndex out of boundaries). Possibly it is not yet adapted to bam files generated from very long reads.
Therefore we used a different approach, i.e. a custom perl script: https://sourceforge.net/projects/simulates/files/validation_reads/count-error.pl/download
samtools mpileup -x -AB -Q 0 -f chasis.fa map-pacbio.sort.bam.bam > pacbio.mpileup
perl count-error.pl pacbio.mpileup
# Which yields:
# Deletions 4370823
# Insertions 4369478
# Basesubs 1655095
# total 107448200
deletions simulated=0.05; obtained= 0.0406
insertions simulated=0.05; obtained=0.0406
The small difference between the simulated and obtained indel rate may be due to difficulties of aligning reads with indels; This is for example demonstrated by the vast number of observed base substitutions (1019677). Since no base substitions were simulated the obtained ones must be an artefact of incorrect alignments. In agreement with this, the total error rate (indels + base substitutions) is with 9.6% almost identical to the simulated 10%.
Finally we investigated the coverage.
cat pacbio.mpileup|awk '{print $2,$4}' > pacbio.coverage
mean coverage expected= 1118.18; observed= 1118.358
This shows that, except for the ends of the chassis, the coverage distribution of PacBio reads is uniform. The lower coverage at the ends of the chassis is expected as reads do not extend beyond the sequence ends. We thus recommend not to consider sequence ends when testing the performance of tools using PacBio reads