GeneticThesaurus Wiki

Annotation of genetic variants in repetitive regions

Brought to you by: tkonopka

Miscellaneous

Tips and Tricks

Index your genome

Many of the programs making up GeneticThesaurus require information about the reference genome, provided through the --genome argument. Index your genome for better performance - see the samtools page for help.

Target regions

Instead of using a whole-genome thesaurus, you may want to study only select regions. To avoid manipulating very large files, you can create a smaller version of the thesaurus containing entries that are pertitent for you, e.g.

java -jar GeneticThesaurus subset
    --thesaurs thesaurus.tsv.gz
    --output small.thesaurus.tsv.gz
    --region chr1:1000000-2000000
    --bed mybed.bed

This should create a small thesaurus file pertaining for a one-megabase section of chr1 and whatever regions you specify in the bed file.

Summarize thesaurus regions

You can obtain a summary of all the regions that can be annotated with the thesaurus resource,

java -jar GeneticThesaurus.jar summarize
    --genome your.genome.fa
    --thesaurus thesaurus.tsv.gz
    --output thesaurus.align.bed
    --what align

This should create a bed file containing intervals that are described in the thesaurus. In more detail, the bed file contains intervals upon which reads were mapped during thesaurus generation.

You can also obtain a similar track showing the intervals from which reads originated but were also mapped elsewhere. Ideally, this track should be equivalent to the one described above (mapping symmetry). Unfortunately, this is not always the case in practice because of imperfect mapping. However, discrepancies affect only small repetiive segments - similar fragments substantially longer than the read length should be captured correctly.