BAIT Wiki

Software to help analyse Strand-Seq data

Status: Beta

Brought to you by: oneillkza, rabadger

Genome Building

Introduction

The common model organisms have been extensively sequenced and can be considered 'complete' (aside from misorientations and orphan fragments. However a large number of early-stage builds are in varying stages of development for many organisms. Generally, genome builds can be classified into three main categories

Scaffold stage. Scaffold stage genomes tend to have many thousands of contigs which have yet to be build into full chromosomes. These fragments can be considered unplaced or structured into linkage groups.
Chromosome stage. Chromosome stage genomes have a chromosomal scaffold, and a combination of unplaced and unlocalized contigs, where the former are contigs that are completely unknown, and the latter are contigs that have been mapped to a particular chromosome but not a particular location. The contigs that ordered into the chromosomal scaffold are often separated by unbridged gaps, and can be incorrectly oriented.
Complete. A complete genome build has mostly finished chromosomes with few sequence gaps and few orphan scaffolds. These genomes tend not to need rebuilding using this function of BAIT, but can benefit from mis-orientation and orphan fragment analysis.

BAIT is different from typical scaffolders as it does not look for sequence overlap to order contigs correctly. It uses the strand inheritance as a signature rather than sequence overlap. There are parallels between BAIT and an assembler however. A regular scaffolding algorithm will search each contig looking for a particular signature, sequence overlap, and any contigs that have enough overlap will be stitched together to form a supercontig. BAIT takes the inherited template strand as a signature. If there are 100 contigs that make up chromosome 1, and the cell being sequenced has inherited both Watson templates for chr1, then all 100 contigs should be WW, and if a second cell inherited both Crick templates for chr1, then all 100 contigs should be CC. In this way, all 100 contigs should always have the same state if they are derived from the same chromosome (ie their correlation will be 100%). In an organism with multiple chromosomes, each chromosome has a 25% chance of being WW, a 50% chance of being WC and a 25% chance of being CC, and therefore on average any two contigs chosen at random will show the same inheritance pattern 50% of the time (ie their correlation will be 50%). By incorporating multiple libraries into the analysis, all the contigs that are present on the same chromosome will tend toward 100% concordance, forming a linkage group. Each chromosome should form its own linkage group, with the concordance within each group ~100%, and the concordance between each group ~50%.

Scaffolding software will look for identical overlapping sequences to stitch contigs together. It will also look for reverse-complement matching sequences as it is possible that two contigs are mis-oriented with respect to each other. In this case, one of the contigs is flipped to make the overlapping sequence identical and then stitched together. BAIT uses a similar strategy. If a contig is mis-oriented with it's neighbour, the strand inheritance pattern will be reversed; WW will become CC, CC will become WW, but WC will remain WC. By excluding the WC fragments, mis-oriented contigs will have the same inheritance pattern 0% of the time. Therefore, we have a situation where correctly oriented contigs have 100% concordance, incorrectly oriented contigs have a 0% concordance, and random contigs have a 50% concordance. Using this, BAIT can flip mis-oriented contigs and cluster them correctly.

Once formed into linkage groups, contigs can be considered as going from 'unplaced' to 'localized', at least with respect to each other. They can be further hierarchically clustered using SCE. SCE events will reshuffle template strands within a particular library. For example, if there are 100 chr1 contigs, and analysis is performed on 50 libraries, if in one library there is an SCE between contig 80 and 81, then contigs 1 to 80 will have 100% concordance, and contigs 81 to 100 will have 100% concordance, but contigs 1 to 80 will only have 98% concordance (49/50) to contigs 81-100. Without prior knowledge of contig order, it is possible to infer distance based on the concordance. In this way, these analyses can be considered similar to genetic mapping using linkage analysis, where, meiotic recombination is responsible for reshuffling a signature (minisatellites) and assuming a constant rate of recombination, a distance measured in centi-Morgans per megabase can be made. Here mitotic recombination is responsible for the reshuffling of a signature (template state) and assuming a constant rate, a distance can also be made.

Typical Run

BAIT -A 2 -kv

-A 2

The Assembly option triggers BAIT to specifically count contigs and attempt to order scaffolds correctly. This option bypasses most BAIT functions and simply calculates the frequency of Watson and Crick reads for each fragment for each library. These data are then filtered in two directions. First, any library in which all fragments are WC (indicates unsuccessful Strand-seq) or NA (indicates low-read library) are excluded. Second, any fragment in which all reads are WC (indicates simple sequence in fragment) or NA (indicates hard to sequence or small fragment) is excluded. A further check of background is made by comparing the ratio of Watson to Crick reads in each library. The ratio of Watson to Crick reads should either be 1.0 for WW, 0 for WC and -1 for WC. Background is measured by assessing the deviation away from those numbers, and any library with a background above 10 % is excluded.

-k

The "keep" option keeps all intermediary files. Since the genome building pipeline is still in beta, it is recommended to use this option so that the time-consuming analysis is not lost in the unlikely event of a crash or bug.

BAIT pipeline for building scaffold-stage genomes

Output Files

Heatmap; global

The global heatmap gives an overall view of the clusters generated by the analysis. If Strand-seq has been successful, each cluster should represent fragments derived from the same chromosome. The fragment names are often too small to be read on the heatmap, but are printed separately into a table.

Sample heat map from mm9 contigs

Heatmap; per linkage group

Since scaffolds that are present on different chromosomes may still affect the order of scaffolds on the same chromosomes (by chance some fragments may be more concordant than others), BAIT splits the cluster tree into a discrete number of clusters and recomputes the order of fragments without the influence of other linkage groups. If it finds multiple sub-clusters it will further divide these linkage groups.

Sub-clustered heat map for chrY

Table of fragment order

A table is generated of all the fragments for each linkage group. This is in the form of a bed file, with the fragment name, start and finish, direction, a dissimilarity value (a measure of 'mitotic distance') and the number of libraries in which this fragment is present. This file can be fed into the BAIT fastq generator to create a draft assembly based on BAIT predictions.

Table of orphans

For all fragments that do not cluster with any other fragments, a record is kept and printed as a table of orphan fragments. It is possible that once genomes have been built in using the genome building function of BAIT, the orphan fragment localization function may further refine the location of these fragments.

Future Updates

A new version of this program is in beta, where the software only feeds in 500 contigs at a time to overcome a bug where genomes with lots of fragments (>20,000) crash the program as all the data is stored in RAM. The new version computes dissimilarities in batches
The new version of this program takes a different approach to collating and ordering contigs. It first 'collapses' all clusters into primary linkage groups, then it looks for dissimilarities to see if any of the primary linkage groups are on the same chromosome by oriented in a different direction. This strategy involves less computing that trying to identify the orientation of each contig with respect to all other contigs, and should be more robust. After reorientation, primary linkage clusters will be 'uncollapsed' and the relative order of each contig will be computed by both heatmap clustering (using hclust) and using a travelling salesperson approach.
Using a similar plotting function for completed genomes in which there is >100 orphan fragments is also planned.

Jump to:

Wiki Main Page
What is Strand-seq and how does it work?
Tutorial for strand inheritance studies
Tutorial for sister chromatid exchange studies
Tutorial for identifying genomic rearrangements
Tutorial for localization of orphan fragments
Tutorial for building early stage genomes