Sample Scripts - Personal Use Wiki

Collection of scripts for personal use

Status: Planning

Brought to you by: xpnc

Home

Authors:

Maximum Common Genome Phylogeny (MCGP) Analysis Tool

MCGH is a bioinformatics analysis tool written in Python for generating phylogenetic trees using the sequence variation present within the ‘core’ or conserved genomic content of bacterial isolates. Usually construction of phylogenetic trees for bacterial isolates using whole genome sequences is challenging due to the excessive lateral or horizontal gene transfers (LGT or HGT) between species. This makes phylogenetic analysis challenging because LGTs or HGTs can seriously bias evolutionary analyses because some isolates have multiple evolutionary histories. To overcome this issue, it is recommended to construct phylogenetic trees using genomic sequences that are conserved in all that species to be used i.e. the core genome. This part of the genome is not hugely affected by LGTs or HGTs that reshuffles the gene content although small legitimate genetic recombination can occur between homologous sequences.

Common approaches used to infer phylogenies from whole genome sequences include; (a) Mapping nucleotide sequence reads against a closely related completed reference sequence to produce consensus sequences that are then used to infer phylogenetic trees. (b) Using selected single locus e.g. 16S rRNA or multiple loci (e.g. housekeeping genes used for Multilocus Sequence Typing (MLST) in Streptococcus pneumoniae and other related bacteria. (c) Using the core genome content universally present in all the bacterial isolates in the dataset. The first approach has been widely used especially for bacterial isolates within the same species that have relatively lower divergence levels. However, when working with highly divergent isolates i.e. representatives of a genus, the inferred phylogenies are not robust enough. The second approach is similar to the last only that it uses fewer conserved loci. Although this is very useful when attempting to infer evolutionary histories between a diverse set of bacterial species spanning multiple genera, it has very low discriminatory power when analysing isolates within a single species i.e. it can not reliably distinguish bacterial isolates at serovars/serotypes/pathotypes level. This problem is overcome by using all variation present in the core genome i.e. approach (c). Despite the existence of a wide range of tools for performing individual steps involved when generating a phylogenetic tree using ‘core’ genome, this is usually very tedious as no simple pipelines exists that glue together these computational tools and generate useful statistics.

Here we present a novel tool known as Maximum Common Genome Phylogeny (MCGP) that will attempt to simplify the steps involved in construction of phylogenetic trees based on the core genome. This tool would be invaluable to biologists who have limited computational skills and advanced bioinformatics experts who could incorporate the script into their analysis pipelines.

Below are some useful links on how to run the program. Please leave any comments, suggestions and reviews to help improve the program (it's currently the first version!).

Project Members:

cc (admin)