Home / src
Name Modified Size InfoDownloads / Week
Parent folder
TreeNode.java 2012-08-02 1.7 kB
SequenceFactory.java 2012-08-02 2.3 kB
TreeAnnotator.java 2012-08-02 9.7 kB
Sequence.java 2012-08-02 1.4 kB
NewickParser.java 2012-08-02 3.4 kB
PhyloTreePruner.java 2012-08-02 5.3 kB
SequenceFactory.class 2012-07-03 2.2 kB
TreeAnnotator.class 2012-07-03 6.4 kB
Sequence.class 2012-07-03 985 Bytes
README.txt 2012-07-03 6.0 kB
NewickParser.class 2012-07-03 3.3 kB
PhyloTreePruner.class 2012-07-03 5.3 kB
TreeNode.class 2012-07-03 1.2 kB
Totals: 13 Items   49.3 kB 0
PhyloTreePruner v. 1.0



Disclaimer:
PhyloTreePruner and the bundled scripts are distributed in the hope that they will be useful. However, we provide these tools WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.



Background:
In order to improve orthology determination for phylogenomics using a tree-based approach, we have developed PhyloTreePruner. The development of this software was inspired by Dunn et al. (2008) and Hejnol et al. (2009) who used a similar approach to refine orthology inferences achieved with a graph-based method. PhyloTreePruner starts with alignments of putatively homologous sequences (generated using any orthology inference method) and a Newick-format gene tree for each alignment (generated using any phylogenetic reconstruction method). The software then identifies subtrees and corresponding sequences that represent orthologs suitable for concatenation and species tree reconstruction. In order to demonstrate the utility of PhyloTreePruner for selection of orthologous groups of sequences, we assembled a dataset of protein-coding gene sequences derived from 11 taxa with completely sequenced genomes, made single-gene trees for each group, and applied the PhyloTreePruner pruning algorithm to these trees to identify and remove paralogous sequences (see the example_dataset folder).

PhyloTreePruner applies the following algorithm: First, poorly supported nodes with support values below a (user-selected) value are collapsed into polytomies. Next, the maximally inclusive subtree is identified and retained if it meets the following criteria: all taxa are represented by 0-1 sequence(s) or, if >1 sequence is present for 1+ taxa, all sequences from each taxon form a monophyletic clade or are part of the same polytomy. Notably, polytomies including >1 sequence from 2+ taxa are permitted. Preliminary runs of PhyloTreePruner on well-studied genes that are single-copy in most eukaryotes showed that this decreased the number of sequences unnecessarily deleted because a weakly supported tree topology incorrectly recovering orthologs as paralogs. Putative paralogs (sequences falling outside of the maximally inclusive subtree identified above) are then deleted from the alignment by PhyloTreePruner. In cases where multiple sequences from the same taxon that formed a clade (in-paralogs) were retained, all but the longest sequence are deleted ("-u" flag). Notably, this feature can be disabled and another program (e.g., SCaFoS; Roure et al. 2007) can be used to select the ‘best’ sequence for each taxon using another metric (e.g, pairwise distance; "-r" flag). 

PhyloTreePruner is implemented in Java SDK 6. It is freely available from sourceforge.net/projects/phylotreepruner/.


Installation:
Simply copy all (seven) of the .class files to a directory in your path. We recommend /usr/local/bin/.

Assuming you have already extracted the contents of the PhyloTreePruner.zip archive to a temporary location, simply type:
sudo cp *.class /usr/local/bin



Usage:
java PhyloTreePruner input_tree_file min_number_of_taxa input_fasta bootstrap_cutoff r/u
    #r = redundant, you can use SCaFoS to pick the best sequence for a given OTU
    #u = unique, pick the longest sequence for a given OTU

Example command:
java PhyloTreePruner 0001.tre 10 0001.fa 0.5 u

PhyloTreePruner has been tested on fasta files formatted such as those in the OGs folder in the example dataset. Avoid spaces and non-alphanumeric characters other than underscores (the pipe symbol may only be used to separate the two fields).
Input file fasta header format:
>TAXON_NAME|UNIQUE_SEQUENCE_ID
example header: >Meiomenia_swedmarki|Contig00001_Hsp90
	
Tree files should be in Newick format with support values such as the bipartition trees produced by RAxML or trees produced by FastTree. Trees generated using other methods should also work as long as they are Newick format.
example tree: (((Tribolium|974076.1:0.35482,(Tribolium|967226.1:0.15022,Tribolium|974050.1:0.14895)0.997:0.18337)0.994:0.14810,(Pediculus|009464-PA:0.50795,(Tribolium|966909.2:0.36446,(Tribolium|966820.2:0.33076,(Tribolium|968088.1:0.00751,Tribolium|968483.1:0.02263)1.000:0.27839)0.899:0.11795)1.000:0.33066)0.440:0.07992)0.911:0.09361,((Apis|394579.2:0.24895,Nasonia|001604903.1:0.29313)0.993:0.13683,(Aedes|AAEL000119-PA:0.41844,((Drosophila|FBpp0083935:0.34881,(Aedes|AAEL005062-PA:0.10814,Culex|CPIJ003656:0.04514)0.999:0.17177)0.953:0.06745,(Aedes|AAEL000127-PA:0.06643,Culex|CPIJ010716:0.11493)1.000:0.30672)0.544:0.05666)0.994:0.11491)0.614:0.03273,(((Daphnia|303557:0.46088,(Daphnia|311964:0.28357,(Daphnia|204305:0.21247,Daphnia|219439:0.29991)0.198:0.04739)1.000:0.34008)1.000:0.29920,((Ixodes|ISCW002964-PA:0.06051,Ixodes|ISCW018993-PA:0.05272)1.000:1.03567,(Bombyx|BGIBMGA003693-PA:0.44375,(Bombyx|BGIBMGA009675-PA:0.17554,Bombyx|BGIBMGA009676-PA:0.40442)1.000:0.29207)0.949:0.16772)0.697:0.10042)0.917:0.07372,(Acyrthosiphon|001948387.1:0.54787,Tribolium|973874.2:0.47141)0.605:0.05982)0.888:0.06349);



Wrapper scripts:
Two shell wrapper scripts to automate several steps involved in orthology refinement using PhyloTreePruner have been provided. These scripts assume you are starting with a folder containing many fasta files each containing a set of unaligned homologous groups of sequences generated using a program such as OrthoMCL or HaMStR. Each group is then aligned using MAFFT and each alignment is then trimmed using Gblocks. Finally, single-gene trees are generated for each group and PhyloTreePruner is used to further refine orthology inference. We urge the user to carefully examine these scripts and ensure that the program-specific settings being employed are appropriate for his or her data. These scripts were written and tested in Ubuntu Linux but should work on most UNIX platforms.



Contact:
If you have any questions, comments, suggestions, or find any bugs, please do not hesitate to contact Kevin Kocot at kmkocot@auburn.edu.
Source: README.txt, updated 2012-07-03