<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recent changes to Example</title><link>https://sourceforge.net/p/geneticthesaurus/wiki/Example/</link><description>Recent changes to Example</description><atom:link href="https://sourceforge.net/p/geneticthesaurus/wiki/Example/feed" rel="self"/><language>en</language><lastBuildDate>Sat, 03 May 2014 06:52:26 -0000</lastBuildDate><atom:link href="https://sourceforge.net/p/geneticthesaurus/wiki/Example/feed" rel="self" type="application/rss+xml"/><item><title>Example modified by Tomasz</title><link>https://sourceforge.net/p/geneticthesaurus/wiki/Example/</link><description>&lt;div class="markdown_content"&gt;&lt;h2 id="example"&gt;Example&lt;/h2&gt;
&lt;p&gt;To see the GeneticThesaurus in action, you can try it out on an example. The &lt;a class="" href="http://sourceforge.net/projects/geneticthesaurus/files/Example/example.PE.zip/download"&gt;example zip file&lt;/a&gt; contains almost all files necessary to carry out a complete thesaurus annotation workflow (you will need to provide your own fasta file for the hg19 genome).&lt;/p&gt;
&lt;p&gt;File &lt;em&gt;example.truth.vcf&lt;/em&gt; is a definition of a true site of variation on chromosome 1.&lt;/p&gt;
&lt;p&gt;File &lt;em&gt;example.PE.bam&lt;/em&gt; contains 124 SAM records mapped at two genomic regions on chromosome 1. The reads are synthetically generated so that the true variant is encoded into the read sequences. You can check that some reads are misaligned because their mapping position does not correspond to the position of origin encoded in the read names.&lt;/p&gt;
&lt;p&gt;File &lt;em&gt;example.thesaurus.tsv&lt;/em&gt; contains a small thesaurus table that describes the two genomic regions covered in the alignment.&lt;/p&gt;
&lt;p&gt;The remaining files represent the output of a variant discovery workflow. These are files you can reproduce using the software.&lt;/p&gt;
&lt;h4 id="initial-variant-calling"&gt;Initial variant calling&lt;/h4&gt;
&lt;p&gt;To begin, call variants from the bam file. At this stage, it is important to instruct a variant caller to try call variants even in low mappability regions. Using Bamformatics, this would be achieved, e.g. (you need to provide your own hg19.fa file)&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span class="n"&gt;java&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt; &lt;span class="n"&gt;Bamformatics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt; &lt;span class="n"&gt;callvariants&lt;/span&gt;
    &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;genome&lt;/span&gt; &lt;span class="n"&gt;hg19&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fa&lt;/span&gt;
    &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;bam&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bam&lt;/span&gt;    
    &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vcf&lt;/span&gt;
    &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;minmapqual&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Note it is important here to set the --minmapqual option because low-mapping quality reads are ignored by default. &lt;/p&gt;
&lt;p&gt;After variant calling, you should have a VCF file with a raw set of calls. Compare with the raw calls (two variants) provided in the zip file (example.PE.vcf). &lt;/p&gt;
&lt;h4 id="filtering-with-the-thesaurus"&gt;Filtering with the thesaurus&lt;/h4&gt;
&lt;p&gt;Then, annotate the variants using the thesaurus, e.g. &lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span class="n"&gt;java&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt; &lt;span class="n"&gt;GeneticThesaurus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jar&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt;
    &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;genome&lt;/span&gt; &lt;span class="n"&gt;hg19&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fa&lt;/span&gt;
    &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;thesaurus&lt;/span&gt; &lt;span class="n"&gt;small&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
    &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;bam&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bam&lt;/span&gt;
    &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;vcf&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vcf&lt;/span&gt;
    &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thesaurus&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This should run very quickly and produce three output files with endings vcf.gz, vtf.gz, and baf.tsv.gz. You can decompress them with gzip if you like. Compare the results with the similarly names files provided in the zip file.&lt;/p&gt;
&lt;h4 id="interpretation"&gt;Interpretation&lt;/h4&gt;
&lt;p&gt;Consider the output files provided in the zip file. The raw calls consist of two variants, of which one is true and the other is false.&lt;/p&gt;
&lt;p&gt;After annotation, we know that the two raw calls come from reads that are ambiguously mapped. We also know the alternative sites for each variant and can deduce that they are in fact linked together. We can thus interpret the raw calls as a cluster of evidence for a single variant. We thus reduce the number of false positives.&lt;/p&gt;
&lt;p&gt;After annotation, we still cannot infer whether the true variant is located at the first locus or the second. For that, we would need to perform a different sequencing experiment.&lt;/p&gt;
&lt;h4 id="further-examples"&gt;Further examples&lt;/h4&gt;
&lt;p&gt;The data in this example is a subset of a larger test dataset described in the manuscript. &lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tomasz</dc:creator><pubDate>Sat, 03 May 2014 06:52:26 -0000</pubDate><guid>https://sourceforge.netb32a0c36825dd8906c3b90f303457c52fd2207e7</guid></item></channel></rss>