GeneticThesaurus Wiki

Annotation of genetic variants in repetitive regions

Brought to you by: tkonopka

Example

To see the GeneticThesaurus in action, you can try it out on an example. The example zip file contains almost all files necessary to carry out a complete thesaurus annotation workflow (you will need to provide your own fasta file for the hg19 genome).

File example.truth.vcf is a definition of a true site of variation on chromosome 1.

File example.PE.bam contains 124 SAM records mapped at two genomic regions on chromosome 1. The reads are synthetically generated so that the true variant is encoded into the read sequences. You can check that some reads are misaligned because their mapping position does not correspond to the position of origin encoded in the read names.

File example.thesaurus.tsv contains a small thesaurus table that describes the two genomic regions covered in the alignment.

The remaining files represent the output of a variant discovery workflow. These are files you can reproduce using the software.

Initial variant calling

To begin, call variants from the bam file. At this stage, it is important to instruct a variant caller to try call variants even in low mappability regions. Using Bamformatics, this would be achieved, e.g. (you need to provide your own hg19.fa file)

java -jar Bamformatics.jar callvariants
    --genome hg19.fa
    --bam example.PE.bam    
    --output calls.vcf
    --minmapqual 1

Note it is important here to set the --minmapqual option because low-mapping quality reads are ignored by default.

After variant calling, you should have a VCF file with a raw set of calls. Compare with the raw calls (two variants) provided in the zip file (example.PE.vcf).

Filtering with the thesaurus

Then, annotate the variants using the thesaurus, e.g.

java -jar GeneticThesaurus.jar filter
    --genome hg19.fa
    --thesaurus small.
    --bam example.PE.bam
    --vcf calls.vcf
    --output calls.thesaurus

This should run very quickly and produce three output files with endings vcf.gz, vtf.gz, and baf.tsv.gz. You can decompress them with gzip if you like. Compare the results with the similarly names files provided in the zip file.

Interpretation

Consider the output files provided in the zip file. The raw calls consist of two variants, of which one is true and the other is false.

After annotation, we know that the two raw calls come from reads that are ambiguously mapped. We also know the alternative sites for each variant and can deduce that they are in fact linked together. We can thus interpret the raw calls as a cluster of evidence for a single variant. We thus reduce the number of false positives.

After annotation, we still cannot infer whether the true variant is located at the first locus or the second. For that, we would need to perform a different sequencing experiment.

Further examples

The data in this example is a subset of a larger test dataset described in the manuscript.

Wiki: Home

GeneticThesaurus Wiki

Annotation of genetic variants in repetitive regions

Example

Example

Initial variant calling

Filtering with the thesaurus

Interpretation

Further examples

Related