Menu

Followup

Tomasz

Followup

Creating a set of thesaurus annotations for a variant set is just the start of an analysis. The GeneticThesaurus software provides several utilities to help with the followup. These are described below.

An associated R package R package RGeneticThesaurus provides additional tools for working with thesaurus annotations.

Evaluating thesaurus output

To evaluate the performance of thesaurus annotation, you can compare sets of variant calls,

java -jar GeneticThesaurus.jar compare
    --genome your.genome.fa
    --ref true.variants.vcf.gz
    --vcf called.variants.vcf.gz
    --synonyms called.variants.vtf.gz

This program requires two vcf files, one of which is treated as a ground truth and the other as a set of called variants. The program also accepts a VTF file with alternate loci. The output consists of multiple files separating variants into groups:

  • True Positiives (TP) - sites in the VCF that match the ground truth;
  • Thesaurus True Positives (TTP) - sites in the VCF that do not match the ground truth, but for which one of the annotated alternate sites matches the ground truth;
  • False Positives (FP) - sites in the VCF which do not match the ground truth and for which none of the alternate sites matches the ground truth;
  • False Negatives (FN) - sites in the ground truth set that are not recorded in the VCF or in the synonyms file.

Note: This type of comparison is also possible via the R package RGeneticThesaurus.

Network analysis

Thesaurus annotation creates conceptual links between multiple sites in the genome. Sometimes, annotation can link multiple sites in a VCF file together - this can arise if evidence for a true variant is distributed onto, say, two genomic sites which are identified separately during variant calling. To identify such clusters of variants, run the network program,

java -jar GeneticThesaurus.jar network
    --genome your.genome.fa
    --vcf variants.thesaurus.vcf.gz
    --vtf variants.thesaurus.vtf.gz
    --output variants.clusters

Output will consist of a table with one entry per line in the VCF file. Each variant will also be labeled with a cluster number. Variants with identical cluster number are linked with a thesaurus annotation.

Using a variant database

Sometimes it is useful to know whether links point to previously known sites of variation. To integrate variant ids from a database (e.g. dbSNP) into a thesaurus analysis, run the vtf annotation tools,

java -jar GeneticThesaurus.jar annotatevtf
    --vtf variants.thesaurus.vtf.gz
    --database dbSNP.vcf.gz
    --output variants.thesaurus.dbSNP.vtf.gz

The output will consist of a new vtf file, but this new file will incorporate variant ids from the databse. The ids are visible in the output vtf file and can be viewed in any text editor. The R package also provides some support in reading database-annotated links.


MongoDB Logo MongoDB