Menu

Validate_reads

Robert Kofler
Attachments
coverage_sizeadjust.png (79571 bytes)
histo_size_adj.png (60973 bytes)
pacbio.coverage.png (23686 bytes)
rld-histo.png (10726 bytes)
rld.png (7650 bytes)

Introduction

Here, we validate the scripts generating Illumina and PacBio reads. We show the entire protocol used for validating the scripts.

Results of the validation are highlighted in yellow

Illumina

Download https://sourceforge.net/projects/simulates/files/SanMiguel/chasis.fa/download

Simulate Illumina paired-end reads; read length = 100; inner distance = 100 (outer distance = 300); standard deviation of the inner distance = 20; error rate = 0.01; number of reads = 50000 (the resulting coverage should be 100);

python ~/dev/simulate/read_pool-seq_illumina-PE.py --pg chasis.fa --read-length 100 --inner-distance 100 --std-dev 20 --error-rate 0.01 --reads 50000 --fastq1 reads1.fq --fastq2 reads2.fq  
# bwa 0.7.15-r1140
bwa mem chasis.fa reads1.fq reads2.fq|samtools view -Sb - | samtools sort - mapped.sort

Error rate

We estimate the error rate using Picard (v2.9.4)

java -jar ~/programs/picard-2.9.4/picard.jar CollectAlignmentSummaryMetrics I= mapped.sort.bam R= chasis.fa O= sumary.metrics.txt

and obtain:

## METRICS CLASS    picard.analysis.AlignmentSummaryMetrics
CATEGORY    TOTAL_READS PF_READS    PCT_PF_READS    PF_NOISE_READS  PF_READS_ALIGNED    PCT_PF_READS_ALIGNED    PF_ALIGNED_BASES    PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE    PF_HQ_ERROR_RATE    PF_INDEL_RATE   MEAN_READ_LENGTH    READS_ALIGNED_IN_PAIRS  PCT_READS_ALIGNED_IN_PAIRS  PF_READS_IMPROPER_PAIRS PCT_PF_READS_IMPROPER_PAIRS BAD_CYCLES  STRAND_BALANCE  PCT_CHIMERAS    PCT_ADAPTER SAMPLE  LIBRARY READ_GROUP
FIRST_OF_PAIR   50000   50000   1   0   50000   1   4999527 50000   4999527 4999527 1   0.009905    0.009905    0   100 50000   1   0   0   0   1   0   0   
SECOND_OF_PAIR  50000   50000   1   0   50000   1   4999544 50000   4999544 4999544 1   0.00993 0.00993 0   100 50000   1   0   0   0   0   0   0           
PAIR    100000  100000  1   0   100000  1   9999071 100000  9999071 9999071 1   0.009917    0.009917    0   100 100000  1   0   0   0   0.5 0   0       
  • error rate simulated = 0.01; obtained = 0.099

Coverage and read distribution

Next we investigated the coverage. In particular we were interested whether we find the correct mean coverage and if all regions of the chassis are used as templates for reads, i.e. whether we find an even coverage within the boundaries of a Poission distribution.

samtools mpileup mapped.sort.bam > cha.mpileup
cat cha.mpileup|awk '{print $2,$4}'> coveragedistribution.forR 
## the following results were than obtained in R
  • mean coverage simulated = 100; obtained = 99.9947

  • standard deviation coverage simulated (Poisson) = sqrt(100) = 10; obtained = 10.94

The following graph shows the coverage distribution along the chassis (100kb length).

inner distance

Next we tested whether the inner distance of Illumina paired-ends is simulated correctly.
We estimated the inner distance using Picard (v2.9.4)

java -jar ~/programs/picard-2.9.4/picard.jar CollectInsertSizeMetrics I= mapped.sort.bam O= is.txt H= histo.pdf

which yields the following output

## METRICS CLASS        picard.analysis.InsertSizeMetrics
MEDIAN_INSERT_SIZE      MEDIAN_ABSOLUTE_DEVIATION       MIN_INSERT_SIZE MAX_INSERT_SIZE MEAN_INSERT_SIZE        STANDARD_DEVIATION      READ_PAIRS      PAIR_ORIENTATION        WIDTH_OF_10_PERCENT     WIDTH_OF_20_PERCENT     WIDTH_OF_30_PERCENT     WIDTH_OF_40_PERCENT     WIDTH_OF_50_PERCENT     WIDTH_OF_60_PERCENT     WIDTH_OF_70_PERCENT     WIDTH_OF_80_PERCENT     WIDTH_OF_90_PERCENT     WIDTH_OF_99_PERCENT     SAMPLE  LIBRARY READ_GROUP
299     14      217     381     299.52654       20.084087       50000   FR      7       11      17      23      29      35      43      53      67      105                     
  • mean fragment size simulated = 300; obtained = 299.526

  • standard deviation of fragment size simulated = 20; obtained = 20.084

To test if the observed and expected fragment size distributions are identical, we extracted the fragment size (samtools view mapped.sort.bam|awk '{print $9}'> obs-rld.txt) and used the following Chi-square test:
https://sourceforge.net/projects/simulates/files/validation_reads/chisquare.R/download

Pearson's Chi-squared test; X-squared = 91.399, df = 164, p-value = 1

This demonstrates that the observed and expected fragment size distributions are identical

The following graph (generated by Picard) shows the fragment size distribution of the simulated paired-end reads:

PacBio

PacBio reads frequently show a bimodal length distribution. SimulaTE allows to specify a read length distribution, with simulated reads following this distribution.
In this validation we use the following read length distribution:

the file can be obtained here https://sourceforge.net/projects/simulates/files/validation_reads/rld.txt/download

We simulate PacBio reads:

python ~/dev/simulate/read_pool-seq_pacbio.py --pg chasis.fa --rld-file rld.txt --error-rate 0.1 --reads 10000 --fasta reads.fa

We simulated 10k reads with an error rate of 0.1 (50% insertions and 50% deletions; this value may be adjusted by the user).

Read length distribution

We computed the read length distribution:

samtools faidx reads.fa
cat reads.fa.fai|awk '{print $2}' > reads-lengs.txt

And compared the expected (line) and observed (histogram) read length distribution in R (ggplot2)

We used the following R-script to test whether the expected and observed distribution are identical https://sourceforge.net/projects/simulates/files/validation_reads/rld-ks.R/download

The simulated and the observed read length distribution are not significantly different (Two-sample Kolmogorov-Smirnov test; D=0.0074, p=0.63), demonstrating that SimulaTE accurately reproduces a given read length distribution

Error rate

First we need to map the reads. We use sensitive search parameters and force the reads to map end-to-end (semi-global alignments)

bwa mem -k14 -W30 -r10 -A1 -B1 -O1 -E1 -L100,100 chasis.fa reads-n4.fa |samtools view -Sb - |samtools sort - map-pacbio.sort.bam

Unfortunatelly Picard decided not to process these data (it crashed with ArrayIndex out of boundaries). Possibly it is not yet adapted to bam files generated from very long reads.

Therefore we used a different approach, i.e. a custom perl script: https://sourceforge.net/projects/simulates/files/validation_reads/count-error.pl/download

samtools mpileup -x -AB -Q 0 -f chasis.fa map-pacbio.sort.bam.bam > pacbio.mpileup
perl count-error.pl pacbio.mpileup
# Which yields:
# Deletions 4370823
# Insertions 4369478
# Basesubs 1655095
# total 107448200
  • deletions simulated=0.05; obtained= 0.0406

  • insertions simulated=0.05; obtained=0.0406

The small difference between the simulated and obtained indel rate may be due to difficulties of aligning reads with indels; This is for example demonstrated by the vast number of observed base substitutions (1019677). Since no base substitions were simulated the obtained ones must be an artefact of incorrect alignments. In agreement with this, the total error rate (indels + base substitutions) is with 9.6% almost identical to the simulated 10%.

Read distribution and coverage

Finally we investigated the coverage.

cat pacbio.mpileup|awk '{print $2,$4}' > pacbio.coverage
  • mean coverage expected= 1118.18; observed= 1118.358

This shows that, except for the ends of the chassis, the coverage distribution of PacBio reads is uniform. The lower coverage at the ends of the chassis is expected as reads do not extend beyond the sequence ends. We thus recommend not to consider sequence ends when testing the performance of tools using PacBio reads


Related

Wiki: Home

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.