Menu

Filter

Tomasz

Annotating variants using a thesaurus

A thesaurus resource is meant to be used to detect and annotate single nucleotide variants in repetitive genomic regions. This process, here called variant filtering, requires access to a thesaurus table, a set of called variants, and a matching alignment file.

Filtering variants

To annotate an existing variant call file, use the command

java -jar GeneticThesaurus.jar filter
    --genome your.genome.fa
    --bam alignment.bam 
    --vcf variants.vcf
    --thesaurus thesaurus.tsv.gz
    --output variants.thesaurus

This should produce three output files called variants.thesaurus.vcf.gz, variants.thesaurus.vtf.gz, and variants.thesaurus.baf.tsv.gz (all files are automatically compressed)

The first file will be in conventional VCF format. It will preserve all information from the original (including filter status), but will also update each variant's status in the filter column. The new filter codes are

  • thesaurus - the variant can be linked with another locus;
  • thesaurusmany - the variant can be linked to a large number of multiple loci;
  • thesaurushard - the variant lies in a region present in the thesaurus, but the read structure is complex and the alternate loci could not be computed.

In addition to the filter codes, each variant in the VCF file will be annotated with a new numeric tag TS (Thesaurus Synonyms) in the sample genotype column. The tag will show the number of alternate sites associated with the variant (0 for variants in non-repetitive regions, larger than 0 for annotated variants).

The second file will be labelled VTF (Variant Thesaurus File). It will consist of one line per variant annotated with the thesaurus filter. Each line will contain two or more tab-separated elements. The first will be the locus of a called variant, subsequent elements the alternate variant sites.

The third output file will be labeled baf (B-allele frequencies). It will contain a table with one entry per variant. Next to the variant position, the file will convey the B-allele frequency as evaluate by naive read counting at the locus. For variants annotated with the thesaurus, there will also be another B-allele frequency estimate that is based on read counting on all alternate loci.

The filtering procedure can take several hours and use considerable memory (allow say 10 hours and 24GB or RAM for processing a 30x whole genome dataset).

Tuning filtering

The thesaurus annotation program can be tuned in a number of ways. You can run the program without any other settings to see a listing of all the available options.

In brief, they can be used as follows:

  • --insertsize <int> - expected insert size length for paired reads. This is important because the position of pairing mates are used to eliminate thesaurus link candidates. </int>
  • --clip <int> - number of base pairs at read ends that are temporarily clipped. Use this parameter to ignore errors near read extremities. </int>
  • --minmapqual <int> - minimum mapping quality. Use this setting to set consistency with the variant calling workflow.</int>
  • --readlen <int> - length of reads.</int>
  • --tolerance <int> - number of tolerated mismatches between a read and the reference sequence. Use this parameter to increase/decrease the number of thesaurus links proposed in the output.</int>
  • --maxtolerance <int> - number of tolerated mismatches between a read and the reference sequence, including called variants. </int>
  • --hitproportion <int> - proportion of reads containing the variant that must be consistent with a thesaurus link. </int>
  • --many <int> - number of thesaurus links proposed with the standard tolerance settings. When a variant is linked with more sites, then the tolerance parameters are automatically made more strict.</int>
  • --toomany <int> - maximum number of thesaurus links allowed in the output. Variants with more links are labeled as 'thesaurusmany'</int>
  • --validate <string> - parameter passed on to parser of SAM records</string>

MongoDB Logo MongoDB