Recent changes to Genome Building

Genome Building modified by Mark_HIlls

Mark_HIlls — Fri, 05 Jul 2013 17:14:00 -0000

--- v8
+++ v9
@@ -48,12 +48,12 @@
 Table of fragment order
 -----------------------

-A table is generated of all the fragments for each linkage group. This is in the form of a list, with a dissimilarity value (a measure of 'mitotic distance') and the number of libraries in which this fragment is present.
+A table is generated of all the fragments for each linkage group. This is in the form of a bed file, with the fragment name, start and finish, direction, a dissimilarity value (a measure of 'mitotic distance') and the number of libraries in which this fragment is present. This file can be fed into the BAIT fastq generator to create a draft assembly based on BAIT predictions.

 Table of orphans
 ----------------

-For all fragments that do not cluster with any other fragments, a record is kept and printed as a table of orphan fragments.  It is possible that once genomes have been built in using the genome building function of BAIT, the orphan fragment localization function may further refine the location of these fragments.
+For all fragments that do not cluster with any other fragments, a record is kept and printed as a table of orphan fragments.  It is possible that once genomes have been built in using the genome building function of BAIT, the [orphan fragment localization function](https://sourceforge.net/p/bait/wiki/Orphan%20Fragment%20Localization/ "link to orphan fragment tutorial") may further refine the location of these fragments.

 Future Updates
 ==============

Genome Building modified by Mark_HIlls

Mark_HIlls — Fri, 05 Jul 2013 00:15:08 -0000

--- v7
+++ v8
@@ -43,6 +43,8 @@

 Since scaffolds that are present on different chromosomes may still affect the order of scaffolds on the same chromosomes (by chance some fragments may be more concordant than others), BAIT splits the cluster tree into a discrete number of clusters and recomputes the order of fragments without the influence of other linkage groups.  If it finds multiple sub-clusters it will further divide these linkage groups.

+![Sub-clustered heat map for chrY](http://i295.photobucket.com/albums/mm141/rareaquaticbadger/BAIT/BAIT_HEATMAP_2013-03-14_Page_70_zpsb793976e.jpg "chrY-cluster heat map")
+
 Table of fragment order
 -----------------------

Genome Building modified by Mark_HIlls

Mark_HIlls — Thu, 04 Jul 2013 23:58:36 -0000

--- v6
+++ v7
@@ -34,7 +34,9 @@
 Heatmap; global
 ---------------

-The global heatmap gives an overall view of the clusters generated by the analysis.  If Strand-seq has been successful, each cluster should represent fragments derived from the same chromosome. The fragment names are often too small to be read on the heatmap, but are printed separately.
+The global heatmap gives an overall view of the clusters generated by the analysis.  If Strand-seq has been successful, each cluster should represent fragments derived from the same chromosome. The fragment names are often too small to be read on the heatmap, but are printed separately into a table.
+
+![Sample heat map from mm9 contigs](http://i295.photobucket.com/albums/mm141/rareaquaticbadger/BAIT/BAIT_HEATMAP_2013-02-25_Page_1_zps804305e0.jpg "heat map for all contigs of mm9")

 Heatmap; per linkage group
 --------------------------

Genome Building modified by Mark_HIlls

Mark_HIlls — Thu, 04 Jul 2013 23:21:53 -0000

--- v5
+++ v6
@@ -1,7 +1,19 @@
 Introduction
 =============

-The common model organisms have been extensively sequenced and can be considered 'complete' (aside from [misorientations](https://sourceforge.net/p/bait/wiki/Identification%20of%20misorients/ "link to misorientation tutorial") and [orphan fragments](https://sourceforge.net/p/bait/wiki/Orphan%20Fragment%20Localization/ "link to orphan fragment tutorial").
+The common model organisms have been extensively sequenced and can be considered 'complete' (aside from [misorientations](https://sourceforge.net/p/bait/wiki/Sister%20Chromatid%20Exchange/ "link to SCE tutorial") and [orphan fragments](https://sourceforge.net/p/bait/wiki/Orphan%20Fragment%20Localization/ "link to orphan fragment tutorial").  However a large number of early-stage builds are in varying stages of development for many organisms.  Generally, genome builds can be classified into three main categories
+
+1. **Scaffold stage.**  Scaffold stage genomes tend to have many thousands of contigs which have yet to be build into full chromosomes.  These fragments can be considered unplaced or structured into linkage groups.
+
++ **Chromosome stage.** Chromosome stage genomes have a chromosomal scaffold, and a combination of unplaced and unlocalized contigs, where the former are contigs that are completely unknown, and the latter are contigs that have been mapped to a particular chromosome but not a particular location.  The contigs that ordered into the chromosomal scaffold are often separated by unbridged gaps, and can be incorrectly oriented.
+
++ **Complete.** A complete genome build has mostly finished chromosomes with few sequence gaps and few orphan scaffolds.  These genomes tend not to need rebuilding using this function of BAIT, but can benefit from mis-orientation and orphan fragment analysis. 
+
+BAIT is different from typical scaffolders as it does not look for sequence overlap to order contigs correctly.  It uses the strand inheritance as a signature rather than sequence overlap.  There are parallels between BAIT and an assembler however.  A regular scaffolding algorithm will search each contig looking for  a particular signature, sequence overlap, and any contigs that have enough overlap will be stitched together to form a supercontig.  BAIT takes the inherited template strand as a signature.  If there are 100 contigs that make up chromosome 1, and the cell being sequenced has inherited both Watson templates for chr1, then all 100 contigs should be WW, and if a second cell inherited both Crick templates for chr1, then all 100 contigs should be CC.  In this way, all 100 contigs should always have the same state if they are derived from the same chromosome (ie their correlation will be 100%).  In an organism with multiple chromosomes, each chromosome has a 25% chance of being WW, a 50% chance of being WC and a 25% chance of being CC, and therefore on average any two contigs chosen at random will show the same inheritance pattern 50% of the time (ie their correlation will be 50%). By incorporating multiple libraries into the analysis, all the contigs that are present on the same chromosome will tend toward 100% concordance, forming a linkage group. Each chromosome should form its own linkage group, with the concordance **within** each group ~100%, and the concordance **between** each group ~50%.
+
+Scaffolding software will look for identical overlapping sequences to stitch contigs together.  It will also look for reverse-complement matching sequences as it is possible that two contigs are mis-oriented with respect to each other.  In this case, one of the contigs is flipped to make the overlapping sequence identical and then stitched together.  BAIT uses a similar strategy.  If a contig is mis-oriented with it's neighbour, the strand inheritance pattern will be reversed; WW will become CC, CC will become WW, but WC will remain WC. By excluding the WC fragments, mis-oriented contigs will have the same inheritance pattern 0% of the time.  Therefore, we have a situation where correctly oriented contigs have 100% concordance, incorrectly oriented contigs have a 0% concordance, and random contigs have a 50% concordance.  Using this, BAIT can flip mis-oriented contigs and cluster them correctly.
+
+Once formed into linkage groups, contigs can be considered as going from 'unplaced' to 'localized', at least with respect to each other.  They can be further hierarchically clustered using SCE.  SCE events will reshuffle template strands within a particular library.  For example, if there are 100 chr1 contigs, and analysis is performed on 50 libraries, if in one library there is an SCE between contig 80 and 81, then contigs 1 to 80 will have 100% concordance, and contigs 81 to 100 will have 100% concordance, but contigs 1 to 80 will only have 98% concordance (49/50) to contigs 81-100.  Without prior knowledge of contig order, it is possible to infer distance based on the concordance.  In this way, these analyses can be considered similar to genetic mapping using linkage analysis, where, meiotic recombination is responsible for reshuffling a signature (minisatellites) and assuming a constant rate of recombination, a distance measured in centi-Morgans per megabase can be made. Here mitotic recombination is responsible for the reshuffling of a signature (template state) and assuming a constant rate, a distance can also be made. 

 Typical Run
 ===========

Genome Building modified by Mark_HIlls

Mark_HIlls — Thu, 04 Jul 2013 21:44:09 -0000

--- v4
+++ v5
@@ -47,13 +47,12 @@
 + Using a similar plotting function for completed genomes in which there is >100 orphan fragments is also planned.

-
 **Jump to:**

 [Wiki Main Page](https://sourceforge.net/p/bait/wiki/Home/ "Wiki main page")
 [What is Strand-seq and how does it work?](https://sourceforge.net/p/bait/wiki/Introduction%20to%20Strand-seq%20and%20BAIT/ "A brief outline of Strand-seq and the BAIT pipeline")
 [Tutorial for strand inheritance studies](https://sourceforge.net/p/bait/wiki/Strand%20Inheritance/ "For immortal strand theory / silent sister hypothesis / epigenetic projects")
 [Tutorial for sister chromatid exchange studies](https://sourceforge.net/p/bait/wiki/Sister%20Chromatid%20Exchange/ "For localization and counts of SCE, and comparison of SCE locations to genomic landscapes")
-[Tutorial for identifying misorientations](https://sourceforge.net/p/bait/wiki/Identification%20of%20misorients/ "For finding large sequence mis-assemblies, correcting completed genomes and bridging unbridged gap regions")
+[Tutorial for identifying genomic rearrangements](https://sourceforge.net/p/bait/wiki/Identification%20of%20genomic%20rearrangements/ "For finding large genomic alterations including translocations, inversions and deletions")
 [Tutorial for localization of orphan fragments](https://sourceforge.net/p/bait/wiki/Orphan%20Fragment%20Localization/ "For localizing unplaced and unlocalized scaffolds in chromosome/complete-stage genomes")
 [Tutorial for building early stage genomes](https://sourceforge.net/p/bait/wiki/Genome%20Building/ "For clustering contigs from scaffold/chromosome-stage genomes into chromosomes and inferring relative orders")

Genome Building modified by Mark_HIlls

Mark_HIlls — Wed, 03 Jul 2013 17:59:29 -0000

--- v3
+++ v4
@@ -1,5 +1,7 @@
 Introduction
 =============
+
+The common model organisms have been extensively sequenced and can be considered 'complete' (aside from [misorientations](https://sourceforge.net/p/bait/wiki/Identification%20of%20misorients/ "link to misorientation tutorial") and [orphan fragments](https://sourceforge.net/p/bait/wiki/Orphan%20Fragment%20Localization/ "link to orphan fragment tutorial").

 Typical Run
 ===========
@@ -7,10 +9,35 @@
     BAIT -A 2 -kv

 -A 2
->The Assembly option triggers BAIT to specifically count contigs and attempt to order scaffolds correctly. This option bypasses most BAIT functions and simply calculates the frequency of Watson and Crick reads for each fragment for each library.
+>The Assembly option triggers BAIT to specifically count contigs and attempt to order scaffolds correctly. This option bypasses most BAIT functions and simply calculates the frequency of Watson and Crick reads for each fragment for each library. These data are then filtered in two directions.  First, any library in which all fragments are WC (indicates unsuccessful Strand-seq) or NA (indicates low-read library) are excluded.  Second, any fragment in which all reads are WC (indicates simple sequence in fragment) or NA (indicates hard to sequence or small fragment) is excluded.  A further check of background is made by comparing the ratio of Watson to Crick reads in each library.  The ratio of Watson to Crick reads should either be 1.0 for WW, 0 for WC and -1 for WC. Background is measured by assessing the deviation away from those numbers, and any library with a background above 10 % is excluded.
+
+-k
+>The "keep" option keeps all intermediary files.  Since the genome building pipeline is still in beta, it is recommended to use this option so that the time-consuming analysis is not lost in the unlikely event of a crash or bug.
+
+![BAIT pipeline for building scaffold-stage genomes](http://i295.photobucket.com/albums/mm141/rareaquaticbadger/BAIT/Slide3_zps4c7c9376.jpg "BAIT pipeline for building scaffold-stage genomes")

 Output Files
 ============
+
+Heatmap; global
+---------------
+
+The global heatmap gives an overall view of the clusters generated by the analysis.  If Strand-seq has been successful, each cluster should represent fragments derived from the same chromosome. The fragment names are often too small to be read on the heatmap, but are printed separately.
+
+Heatmap; per linkage group
+--------------------------
+
+Since scaffolds that are present on different chromosomes may still affect the order of scaffolds on the same chromosomes (by chance some fragments may be more concordant than others), BAIT splits the cluster tree into a discrete number of clusters and recomputes the order of fragments without the influence of other linkage groups.  If it finds multiple sub-clusters it will further divide these linkage groups.
+
+Table of fragment order
+-----------------------
+
+A table is generated of all the fragments for each linkage group. This is in the form of a list, with a dissimilarity value (a measure of 'mitotic distance') and the number of libraries in which this fragment is present.
+
+Table of orphans
+----------------
+
+For all fragments that do not cluster with any other fragments, a record is kept and printed as a table of orphan fragments.  It is possible that once genomes have been built in using the genome building function of BAIT, the orphan fragment localization function may further refine the location of these fragments.

 Future Updates
 ==============
@@ -18,3 +45,15 @@
 + A new version of this program is in beta, where the software only feeds in 500 contigs at a time to overcome a bug where genomes with lots of fragments (>20,000) crash the program as all the data is stored in RAM. The new version computes dissimilarities in batches 
 + The new version of this program takes a different approach to collating and ordering contigs. It first 'collapses' all clusters into primary linkage groups, then it looks for dissimilarities to see if any of the primary linkage groups are on the same chromosome by oriented in a different direction. This strategy involves less computing that trying to identify the orientation of each contig with respect to all other contigs, and should be more robust. After reorientation, primary linkage clusters will be 'uncollapsed' and the relative order of each contig will be computed by both heatmap clustering (using hclust) and using a travelling salesperson approach.
 + Using a similar plotting function for completed genomes in which there is >100 orphan fragments is also planned.
+
+
+
+**Jump to:**
+
+[Wiki Main Page](https://sourceforge.net/p/bait/wiki/Home/ "Wiki main page")
+[What is Strand-seq and how does it work?](https://sourceforge.net/p/bait/wiki/Introduction%20to%20Strand-seq%20and%20BAIT/ "A brief outline of Strand-seq and the BAIT pipeline")
+[Tutorial for strand inheritance studies](https://sourceforge.net/p/bait/wiki/Strand%20Inheritance/ "For immortal strand theory / silent sister hypothesis / epigenetic projects")
+[Tutorial for sister chromatid exchange studies](https://sourceforge.net/p/bait/wiki/Sister%20Chromatid%20Exchange/ "For localization and counts of SCE, and comparison of SCE locations to genomic landscapes")
+[Tutorial for identifying misorientations](https://sourceforge.net/p/bait/wiki/Identification%20of%20misorients/ "For finding large sequence mis-assemblies, correcting completed genomes and bridging unbridged gap regions")
+[Tutorial for localization of orphan fragments](https://sourceforge.net/p/bait/wiki/Orphan%20Fragment%20Localization/ "For localizing unplaced and unlocalized scaffolds in chromosome/complete-stage genomes")
+[Tutorial for building early stage genomes](https://sourceforge.net/p/bait/wiki/Genome%20Building/ "For clustering contigs from scaffold/chromosome-stage genomes into chromosomes and inferring relative orders")

Genome Building modified by Mark_HIlls

Mark_HIlls — Wed, 03 Jul 2013 00:25:20 -0000

--- v2
+++ v3
@@ -15,6 +15,6 @@
 Future Updates
 ==============

-+A new version of this program is in beta, where the software only feeds in 500 contigs at a time to overcome a bug where genomes with lots of fragments (>20,000) crash the program as all the data is stored in RAM. The new version computes dissimilarities in batches 
-+The new version of this program takes a different approach to collating and ordering contigs. It first 'collapses' all clusters into primary linkage groups, then it looks for dissimilarities to see if any of the primary linkage groups are on the same chromosome by oriented in a different direction. This strategy involves less computing that trying to identify the orientation of each contig with respect to all other contigs, and should be more robust. After reorientation, primary linkage clusters will be 'uncollapsed' and the relative order of each contig will be computed by both heatmap clustering (using hclust) and using a travelling salesperson approach.
-+Using a similar plotting function for completed genomes in which there is >100 orphan fragments is also planned.
++ A new version of this program is in beta, where the software only feeds in 500 contigs at a time to overcome a bug where genomes with lots of fragments (>20,000) crash the program as all the data is stored in RAM. The new version computes dissimilarities in batches 
++ The new version of this program takes a different approach to collating and ordering contigs. It first 'collapses' all clusters into primary linkage groups, then it looks for dissimilarities to see if any of the primary linkage groups are on the same chromosome by oriented in a different direction. This strategy involves less computing that trying to identify the orientation of each contig with respect to all other contigs, and should be more robust. After reorientation, primary linkage clusters will be 'uncollapsed' and the relative order of each contig will be computed by both heatmap clustering (using hclust) and using a travelling salesperson approach.
++ Using a similar plotting function for completed genomes in which there is >100 orphan fragments is also planned.

Genome Building modified by Mark_HIlls

Mark_HIlls — Wed, 03 Jul 2013 00:25:01 -0000

--- v1
+++ v2
@@ -0,0 +1,20 @@
+Introduction
+=============
+
+Typical Run
+===========
+
+    BAIT -A 2 -kv
+
+-A 2
+>The Assembly option triggers BAIT to specifically count contigs and attempt to order scaffolds correctly. This option bypasses most BAIT functions and simply calculates the frequency of Watson and Crick reads for each fragment for each library.
+
+Output Files
+============
+
+Future Updates
+==============
+
++A new version of this program is in beta, where the software only feeds in 500 contigs at a time to overcome a bug where genomes with lots of fragments (>20,000) crash the program as all the data is stored in RAM. The new version computes dissimilarities in batches 
++The new version of this program takes a different approach to collating and ordering contigs. It first 'collapses' all clusters into primary linkage groups, then it looks for dissimilarities to see if any of the primary linkage groups are on the same chromosome by oriented in a different direction. This strategy involves less computing that trying to identify the orientation of each contig with respect to all other contigs, and should be more robust. After reorientation, primary linkage clusters will be 'uncollapsed' and the relative order of each contig will be computed by both heatmap clustering (using hclust) and using a travelling salesperson approach.
++Using a similar plotting function for completed genomes in which there is >100 orphan fragments is also planned.

Genome Building modified by Mark_HIlls

Mark_HIlls — Tue, 25 Jun 2013 17:43:35 -0000