Creating a set of thesaurus annotations for a variant set is just the start of an analysis. The GeneticThesaurus software provides several utilities to help with the followup. These are described below.
An associated R package R package RGeneticThesaurus provides additional tools for working with thesaurus annotations.
To evaluate the performance of thesaurus annotation, you can compare sets of variant calls,
java -jar GeneticThesaurus.jar compare
--genome your.genome.fa
--ref true.variants.vcf.gz
--vcf called.variants.vcf.gz
--synonyms called.variants.vtf.gz
This program requires two vcf files, one of which is treated as a ground truth and the other as a set of called variants. The program also accepts a VTF file with alternate loci. The output consists of multiple files separating variants into groups:
Note: This type of comparison is also possible via the R package RGeneticThesaurus.
Thesaurus annotation creates conceptual links between multiple sites in the genome. Sometimes, annotation can link multiple sites in a VCF file together - this can arise if evidence for a true variant is distributed onto, say, two genomic sites which are identified separately during variant calling. To identify such clusters of variants, run the network program,
java -jar GeneticThesaurus.jar network
--genome your.genome.fa
--vcf variants.thesaurus.vcf.gz
--vtf variants.thesaurus.vtf.gz
--output variants.clusters
Output will consist of a table with one entry per line in the VCF file. Each variant will also be labeled with a cluster number. Variants with identical cluster number are linked with a thesaurus annotation.
Sometimes it is useful to know whether links point to previously known sites of variation. To integrate variant ids from a database (e.g. dbSNP) into a thesaurus analysis, run the vtf annotation tools,
java -jar GeneticThesaurus.jar annotatevtf
--vtf variants.thesaurus.vtf.gz
--database dbSNP.vcf.gz
--output variants.thesaurus.dbSNP.vtf.gz
The output will consist of a new vtf file, but this new file will incorporate variant ids from the databse. The ids are visible in the output vtf file and can be viewed in any text editor. The R package also provides some support in reading database-annotated links.