A thesaurus resource is meant to be used to detect and annotate single nucleotide variants in repetitive genomic regions. This process, here called variant filtering, requires access to a thesaurus table, a set of called variants, and a matching alignment file.
To annotate an existing variant call file, use the command
java -jar GeneticThesaurus.jar filter
--genome your.genome.fa
--bam alignment.bam
--vcf variants.vcf
--thesaurus thesaurus.tsv.gz
--output variants.thesaurus
This should produce three output files called variants.thesaurus.vcf.gz, variants.thesaurus.vtf.gz, and variants.thesaurus.baf.tsv.gz (all files are automatically compressed)
The first file will be in conventional VCF format. It will preserve all information from the original (including filter status), but will also update each variant's status in the filter column. The new filter codes are
In addition to the filter codes, each variant in the VCF file will be annotated with a new numeric tag TS (Thesaurus Synonyms) in the sample genotype column. The tag will show the number of alternate sites associated with the variant (0 for variants in non-repetitive regions, larger than 0 for annotated variants).
The second file will be labelled VTF (Variant Thesaurus File). It will consist of one line per variant annotated with the thesaurus filter. Each line will contain two or more tab-separated elements. The first will be the locus of a called variant, subsequent elements the alternate variant sites.
The third output file will be labeled baf (B-allele frequencies). It will contain a table with one entry per variant. Next to the variant position, the file will convey the B-allele frequency as evaluate by naive read counting at the locus. For variants annotated with the thesaurus, there will also be another B-allele frequency estimate that is based on read counting on all alternate loci.
The filtering procedure can take several hours and use considerable memory (allow say 10 hours and 24GB or RAM for processing a 30x whole genome dataset).
The thesaurus annotation program can be tuned in a number of ways. You can run the program without any other settings to see a listing of all the available options.
In brief, they can be used as follows: