Phylogenetics
GeneSupport – Maximum Gene-Support Tree Approach to Infer-ring a Species Tree from Gene Trees
Yunfeng Shan1,* and Xiu-Qing Li2
1School of Computer Science, University of Windsor, 401 Sunset Avenue, Windsor, ON N9B 3P4
2Molecular Genetics Laboratory, Potato Research Centre, Agriculture and Agri-Food Canada,850 Lincoln Rd, P.O. Box 20280, Fredericton, New Brunswick, E3B 4Z7, Canada.
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT
Summary: GeneSupport implements a genome-scale algorithm: Maximum Gene-Support Tree to estimate species tree from gene trees based on multilocus sequences. It provides a new option for multiple genes to infer species tree. It is incorporated into popular phylogentic program: PHYLIP package with the same usage and user interface. It is suitable for phylogenetic methods including not only Baysian, but also maximum parsimony, maximum likelihood, neighbour-joining and so on. These methods are used to reconstruct single gene trees seperately firstly with a variety of phylogenetic inference programs.
1 INTRODUCTION
Sequences from genes, proteins, and genomes are increasing rap-idly with progress of genome projects. Reconstruction of phylog-enies started to use large data sets involving hundreds of genes, up to one thousand orthologous genes . Concatenation procedure are used to infer molecular phylogenies from multilocus with molecu-lar phylogentic methods developed for single genes. It works in some cases ( ), but no in others (). This paper introduces a com-puter program: GeneSupport. It is a tool for estimating species tree from gene trees through comparing many gene trees and comput-ing gene supports of unique gene trees. The GeneSupport program is implemented in C based on the maximum gene-support tree approach proposed by Shan and Li (2008), which described an alternative approach to evaluate the reliability of species phylog-eny inferences based on gene trees (Shan and Li, 2008). It also describes a biologic phenomenon (intuitively obvious): that closely related species share similarities in a higher number of orthologous genes than distantly related species. It is mentioned that “I am very sure that many people working on phylogenetics will find it help-ful, as an alternative to the mere concatenation of separate gene sequences” (from an anonymous reviewer in 2008).
2 METHODS
Computational procedure: Distances between tree pairs are computed. Two trees with identical topology have a tree distance of zero. The number of unique trees with identical topology are counted. Distances are computed based on the widely known Symmetric Difference of Robinson and Foulds (1981). The Sym-metric Difference ignores branch length information, only use the tree topologies. This is the minimum number of steps required to convert between two trees, that is, the number of branches that differ between a pair of trees (Robinson and Foulds, 1981). The Robinson and Foulds topological distance is an important and fre-quently used tool to compare phylogenetic tree structures (Makarenkov and Leclerc, 1999; 2000). It is widely used in PHYLIP (Felsenstein, 1989) or PAUP* (Swofford, 2002) pack-ages. We used some code from program: treedist of PHYLIP package, especially the function for computing the symmetric dis-tances between trees because the function is well tested exten-sively, which is used with the kind permission of Dr. Joseph Fel-senstein. Although the examples we have discussed have involved fully bifurcating trees, the input trees can have multifurcations. For the Symmetric Difference, it can lead to distances that are odd numbers.
Restriction: However, one strong restriction must be noted. The trees should all have the same list of species. If you use one set of species in the first two trees, and another in the second two, the distances will be incorrect and will depend on the order of these pairs in the input tree file, in odd ways.
Gene-support and maximum gene-support tree: The index of gene-support is the number of genes that infer a unique topology, which is equal to the tree frequency when single gene trees are inferred from single genes. The numbers of genes were calculated for all unique gene trees from the phylogenetic reconstruction re-sults. A maximum gene-support tree was defined as a unique tree that was inferred by the highest number of genes among all the gene trees generated. Firstly, users may infer gene trees separately ly with popular phylogentic methods such as maximum parsimony, maximum likeligood (ML), Baysian, neighbour-joining (NJ) se-perately by means of ad hoc phylogenetic analysis packages such as PHYLIP, PAUP*, Mr.Bayes, PAML and so on.
The usage and interface of GeneSupport are similar to those of PHYLIP package for users’convenience for those who experience in the popular PHYLIP package.
3 DEMONSTRATION
The GeneSupport analysis is demonstrated on a sample of te-trapod origin study for 43 genes from 7, 6 taxon sets (Supplemen-tary Materials). As shown in Table 1, maximum gene support tree approach clearly showed that gene supports for four types of trees were not evident different, so 43 genes were not able to resolve the phylogenetic relationship for these 7-taxon set whatever the phy-logentic methods were used. It is recognized that 43 genes did not reach the minimum requirement of the genes for inferring species tree from gene trees for the 7-taxon set.
Currently, number of sampled genes seems to be arbitrary. When a reliable tree is not known, determination of minimum required genes is difficult. 100% bootstrap support does not mean that the branch is 100% correct. 100% bootstrap support may occur in an alternative branch (Phillips et al., 2004) High bootstrap support does not necessarily signify ‘the truth’ (Soltis et al. 2004). When a maximum gene-support value is not evidently different, for example, in the case of 7 taxa, it can be recognized that the number of genes used does not meet the requirement of minimum genes. Sequence data of more genes are required. This is the outstanding advantage of the maximum gene support tree ap-proach. Recently, when sequencing data of one thousand and one genes become available and are used, this approach successfully infers the lungfish as the closest relative of landing vertebrates with significant difference at 0.01 (Chi-Square test) in gene sup-port values between maximum gene tree and other less gene sup-port trees by ML and Bayes methods (Shan et al., 2014). Tree 1 (lungfish hypothesis) was reconstructed by ML 1001 single gene trees with maximum gene tree approach (Fig 2). The gene support value is 92. The second maximum support tree is tree 3 (lungfish-coelancanth sister hypohesis) with gene support value of 59. Chi-square value is 11.836957. The chi-square test shows the significant difference at 0.01 sig-nificant level between tree 1 and tree 3 in gene support val-ues. The third gene support tree is tree 2 (coelancanth hy-pothesis) with gene support value of 51. The results show that lungfish is the closest living relative of landing verte-brates, which is supported by the most genes of 1001 genes. Tree 1 also inferred by BL 1001 consensus gene trees with 50% majority rule. Its gene support is 39. Tree 2 was sup-ported by the second most gene support value: 25. Chi-square test shows significant difference in gene support val-ues between tree 1 and tree 2 at 0.01 significant level (Chi-square value: 7.84).
Table 1. Gene supports for four tree types of 7 taxa inferred with three methods
Type of Trees
Methods Tree I Tree II Tree III Tree IV
MP 2 2 2 2
ML 2 1 2 0
NJ 2 1 1 0
Notes: The 7 taxa included: Mammal (M), Bird (B), Amphibian (A), Coe-lacanth (C),
Lungfish (L), Ray-finned Fish (R), and Shark (S). +, * indicated chi- square test significant level α at P <0.10, 0.05 between the frequencies of tree II and tree I/III, respectively.
Other three demonstrations with this approach were per-formed for yeasts, plants and microorganisms and their maximum gene-support trees as species trees were successfully identified (Shan and Li, 2008).
ACKNOWLEDGEMENTS
We thank Dr. J. Felsenstein for permit the use of his pub-licly available code from the program treedist of PHYLIP package.
REFERENCES
Felsenstein,J. (1989) PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladis-tics, 5,164-6.
Felsenstein,J. (2004) Inferring Phylogenies. Sinauer Associates, Sunderland, Massa-chusetts.
Makarenkov,V. and Leclerc,B. (1999) The fitting of a tree metric to a given dissimi-larity with the weighted least squares criterion. Journal of Classification, 16, 3-26.
Makarenkov, V., and Leclerc, B. (2000) An optimal way to compare additive trees using circular orders. Journal of Computational Biology, 7, 731-744.
Phillips,M.J., Delsuc, F. D. Penny, D. (2004) Genome-scale phylogeny and the detec-tion of systematic biases. Mol. Biol. Evol., 21, 1455-1458
Robinson, D.R., and Foulds, L.R. (1981) Comparison of phylogenetic trees. Mathe-matical Biosciences, 53, 131-147
Shan, Y., and Li, X.Q. (2008) Maximum Gene-Support Tree. Evolutionary Bioinfor-matics, 4, 181 – 191.
Soltis, D.E. et al. (2004) Genome-scale data, angiosperm relationships, and ending incongruence: a cautionary tale in phylogenetics. Trends Plant Sci., 9, 477-483.
Swofford, D.L. (2002) PAUP. Phylogenetic Analysis Using Parsimony (and other methods). Version 4.0b10. Sinauer Associates, Sunderland, Massachusetts.