Recent changes to Filter

Filter modified by Tomasz

Tomasz — Wed, 19 Aug 2015 09:40:46 -0000

--- v7
+++ v8
@@ -33,50 +33,6 @@

-####Evaluating thesaurus output####
-
-To evaluate the performance of thesaurus annotation, you can compare sets of variant calls,
-
-   java -jar GeneticThesaurus.jar compare
-       --genome your.genome.fa
-       --ref true.variants.vcf.gz
-       --vcf called.variants.vcf.gz
-       --synonyms called.variants.vtf.gz
-
-This program requires two vcf files, one of which is treated as a ground truth and the other as a set of called variants. The program also accepts a VTF file with alternate loci. The output consists of multiple files separating variants into groups:
-
-* True Positiives (TP) - sites in the VCF that match the ground truth; 
-* Thesaurus True Positives (TTP) - sites in the VCF that do not match the ground truth, but for which one of the annotated alternate sites matches the ground truth;
-* False Positives (FP) - sites in the VCF which do not match the ground truth and for which none of the alternate sites matches the ground truth;
-* False Negatives (FN) - sites in the ground truth set that are not recorded in the VCF or in the synonyms file.
-
-*Note:* Also check out the R package [RGeneticThesaurus](https://github.com/tkonopka/RGeneticThesaurus). This package can load variants and thesaurus annotations into R data frames and thus enables several types of variant analyses.
-
-####Network analysis####
-
-Thesaurus annotation creates conceptual links between multiple sites in the genome. Sometimes, annotation can link multiple sites in a VCF file together - this can arise if evidence for a true variant is distributed onto, say, two genomic sites which are identified separately during variant calling. To identify such clusters of variants, run the network program,
-
-   java -jar GeneticThesaurus.jar network
-       --genome your.genome.fa
-       --vcf variants.thesaurus.vcf.gz
-       --vtf variants.thesaurus.vtf.gz
-       --output variants.clusters
-
-Output will consist of a table with one entry per line in the VCF file. Each variant will also be labeled with a cluster number. Variants with identical cluster number are linked with a thesaurus annotation.
-
-
-
-####Using a variant database####
-
-Sometimes it is useful to know whether links point to previously known sites of variation. To integrate variant ids into a thesaurus analysis, run the vtf annotation tools,
-
-    java -jar GeneticThesaurus.jar annotatevtf
-        --vtf variants.thesaurus.vtf.gz
-        --database dbSNP.vcf.gz
-        --output variants.thesaurus.dbSNP.vtf.gz
-
-The output will consist of a new vtf file, but this new file will incorporate variant ids from the databse. 
-
 ####Tuning filtering####

 The thesaurus annotation program can be tuned in a number of ways. You can run the program without any other settings to see a listing of all the available options.

Filter modified by Tomasz

Tomasz — Wed, 19 Aug 2015 09:40:00 -0000

Annotate modified by Tomasz

Tomasz — Wed, 19 Aug 2015 09:37:28 -0000

--- v5
+++ v6
@@ -50,7 +50,7 @@
 * False Positives (FP) - sites in the VCF which do not match the ground truth and for which none of the alternate sites matches the ground truth;
 * False Negatives (FN) - sites in the ground truth set that are not recorded in the VCF or in the synonyms file.

-
+*Note:* Also check out the R package [RGeneticThesaurus](https://github.com/tkonopka/RGeneticThesaurus). This package can load variants and thesaurus annotations into R data frames and thus enables several types of variant analyses.

 ####Network analysis####

Annotate modified by Tomasz

Tomasz — Wed, 19 Aug 2015 09:33:28 -0000

--- v4
+++ v5
@@ -1,6 +1,8 @@
 ##Annotating variants using a thesaurus##

 A thesaurus resource is meant to be used to detect and annotate single nucleotide variants in repetitive genomic regions. This process, here called variant filtering, requires access to a thesaurus table, a set of called variants, and a matching alignment file. 
+
+

 ####Filtering variants####

@@ -28,6 +30,7 @@
 The third output file will be labeled *baf* (B-allele frequencies). It will contain a table with one entry per variant. Next to the variant position, the file will convey the B-allele frequency as evaluate by naive read counting at the locus. For variants annotated with the thesaurus, there will also be another B-allele frequency estimate that is based on read counting on all alternate loci.

 The filtering procedure can take several hours and use considerable memory (allow say 10 hours and 24GB or RAM for processing a 30x whole genome dataset). 
+

 ####Evaluating thesaurus output####
@@ -60,8 +63,19 @@
        --output variants.clusters

 Output will consist of a table with one entry per line in the VCF file. Each variant will also be labeled with a cluster number. Variants with identical cluster number are linked with a thesaurus annotation.
-   

+
+
+####Using a variant database####
+
+Sometimes it is useful to know whether links point to previously known sites of variation. To integrate variant ids into a thesaurus analysis, run the vtf annotation tools,
+
+    java -jar GeneticThesaurus.jar annotatevtf
+        --vtf variants.thesaurus.vtf.gz
+        --database dbSNP.vcf.gz
+        --output variants.thesaurus.dbSNP.vtf.gz
+
+The output will consist of a new vtf file, but this new file will incorporate variant ids from the databse. 

 ####Tuning filtering####

Annotate modified by Tomasz

Tomasz — Fri, 19 Jun 2015 13:51:41 -0000

--- v3
+++ v4
@@ -78,4 +78,4 @@
 * --hitproportion <INT> - proportion of reads containing the variant that must be consistent with a thesaurus link. 
 * --many <INT> - number of thesaurus links proposed with the standard tolerance settings. When a variant is linked with more sites, then the tolerance parameters are automatically made more strict.
 * --toomany <INT> - maximum number of thesaurus links allowed in the output. Variants with more links are labeled as 'thesaurusmany'
-* --validate <STRING> - parameter passed on to 
+* --validate <STRING> - parameter passed on to parser of SAM records

Annotate modified by Tomasz

Tomasz — Fri, 19 Jun 2015 13:37:47 -0000

--- v2
+++ v3
@@ -63,3 +63,19 @@

+####Tuning filtering####
+
+The thesaurus annotation program can be tuned in a number of ways. You can run the program without any other settings to see a listing of all the available options. 
+
+In brief, they can be used as follows:
+
+* --insertsize <INT> - expected insert size length for paired reads. This is important because the position of pairing mates are used to eliminate thesaurus link candidates. 
+* --clip <INT> - number of base pairs at read ends that are temporarily clipped. Use this parameter to ignore errors near read extremities. 
+* --minmapqual <INT> - minimum mapping quality. Use this setting to set consistency with the variant calling workflow.
+* --readlen <INT> - length of reads.
+* --tolerance <INT> - number of tolerated mismatches between a read and the reference sequence. Use this parameter to increase/decrease the number of thesaurus links proposed in the output.
+* --maxtolerance <INT> - number of tolerated mismatches between a read and the reference sequence, including called variants. 
+* --hitproportion <INT> - proportion of reads containing the variant that must be consistent with a thesaurus link. 
+* --many <INT> - number of thesaurus links proposed with the standard tolerance settings. When a variant is linked with more sites, then the tolerance parameters are automatically made more strict.
+* --toomany <INT> - maximum number of thesaurus links allowed in the output. Variants with more links are labeled as 'thesaurusmany'
+* --validate <STRING> - parameter passed on to

Annotate modified by Tomasz

Tomasz — Mon, 12 Jan 2015 10:25:08 -0000

--- v1
+++ v2
@@ -1,7 +1,6 @@
 ##Annotating variants using a thesaurus##

-A thesaurus resource is meant to be used to detect and annotate single nucleotide variants in repetitive genomic regions. This process, here called
-variant filtering, requires access to a thesaurus table, a set of called variants, and a matching alignment file. 
+A thesaurus resource is meant to be used to detect and annotate single nucleotide variants in repetitive genomic regions. This process, here called variant filtering, requires access to a thesaurus table, a set of called variants, and a matching alignment file. 

 ####Filtering variants####

Annotate modified by Tomasz

Tomasz — Sat, 03 May 2014 06:45:44 -0000

Annotating variants using a thesaurus

A thesaurus resource is meant to be used to detect and annotate single nucleotide variants in repetitive genomic regions. This process, here called
variant filtering, requires access to a thesaurus table, a set of called variants, and a matching alignment file.

Filtering variants

To annotate an existing variant call file, use the command

java -jar GeneticThesaurus.jar filter
    --genome your.genome.fa
    --bam alignment.bam 
    --vcf variants.vcf
    --thesaurus thesaurus.tsv.gz
    --output variants.thesaurus

This should produce three output files called variants.thesaurus.vcf.gz, variants.thesaurus.vtf.gz, and variants.thesaurus.baf.tsv.gz (all files are automatically compressed)

The first file will be in conventional VCF format. It will preserve all information from the original (including filter status), but will also update each variant's status in the filter column. The new filter codes are

thesaurus - the variant can be linked with another locus;
thesaurusmany - the variant can be linked to a large number of multiple loci;
thesaurushard - the variant lies in a region present in the thesaurus, but the read structure is complex and the alternate loci could not be computed.

In addition to the filter codes, each variant in the VCF file will be annotated with a new numeric tag TS (Thesaurus Synonyms) in the sample genotype column. The tag will show the number of alternate sites associated with the variant (0 for variants in non-repetitive regions, larger than 0 for annotated variants).

The second file will be labelled VTF (Variant Thesaurus File). It will consist of one line per variant annotated with the thesaurus filter. Each line will contain two or more tab-separated elements. The first will be the locus of a called variant, subsequent elements the alternate variant sites.

The third output file will be labeled baf (B-allele frequencies). It will contain a table with one entry per variant. Next to the variant position, the file will convey the B-allele frequency as evaluate by naive read counting at the locus. For variants annotated with the thesaurus, there will also be another B-allele frequency estimate that is based on read counting on all alternate loci.

The filtering procedure can take several hours and use considerable memory (allow say 10 hours and 24GB or RAM for processing a 30x whole genome dataset).

Evaluating thesaurus output

To evaluate the performance of thesaurus annotation, you can compare sets of variant calls,

java -jar GeneticThesaurus.jar compare
    --genome your.genome.fa
    --ref true.variants.vcf.gz
    --vcf called.variants.vcf.gz
    --synonyms called.variants.vtf.gz

This program requires two vcf files, one of which is treated as a ground truth and the other as a set of called variants. The program also accepts a VTF file with alternate loci. The output consists of multiple files separating variants into groups:

True Positiives (TP) - sites in the VCF that match the ground truth;
Thesaurus True Positives (TTP) - sites in the VCF that do not match the ground truth, but for which one of the annotated alternate sites matches the ground truth;
False Positives (FP) - sites in the VCF which do not match the ground truth and for which none of the alternate sites matches the ground truth;
False Negatives (FN) - sites in the ground truth set that are not recorded in the VCF or in the synonyms file.

Network analysis

Thesaurus annotation creates conceptual links between multiple sites in the genome. Sometimes, annotation can link multiple sites in a VCF file together - this can arise if evidence for a true variant is distributed onto, say, two genomic sites which are identified separately during variant calling. To identify such clusters of variants, run the network program,

java -jar GeneticThesaurus.jar network
    --genome your.genome.fa
    --vcf variants.thesaurus.vcf.gz
    --vtf variants.thesaurus.vtf.gz
    --output variants.clusters

Output will consist of a table with one entry per line in the VCF file. Each variant will also be labeled with a cluster number. Variants with identical cluster number are linked with a thesaurus annotation.