Home
Name Modified Size InfoDownloads / Week
README 2013-01-31 3.3 kB
TopologyAnalysis.pl 2013-01-31 4.1 kB
RandomGeneration.pl 2013-01-31 4.0 kB
RandomTaxonomy.pl 2013-01-31 750 Bytes
ExtraOutput.pl 2013-01-31 1.9 kB
MIT-LICENSE 2013-01-31 1.1 kB
Totals: 6 Items   15.1 kB 0
NAME : RandomTaxonomy
AUTHORS : BARRE Aurélien (CBiB)
VERSION : 1.0
INSTALL : no installation procedure needed 
REQUIREMENT : require perl v5.1 + Getopt::Long and  List::Util qw(shuffle) modules
LICENSE : see license file
BUGS : report to abarre@u-bordeaux2.fr
----------------------------------------------


This program generates a random tree from an existing tree (typically a taxonomy such as Greengene) by resampling. The global topology of the new tree (number of levels, number of nodes) is preserved.

INPUT

A fasta file with a header line containing the leaf id (numeric) at the beginning, and the complete lineage of the leaf. The lineage is ordered from root to leaves. A leaf can be linked to any level of the tree. Each level of the lineage is described as a couple (level name, level node name) (for example, k__bacteria is the node "bacteria" at the k level).

example  
>14  k__Archae;  p__Euryarchaeota; c__Thermoplasmata; o__Thermoplasmatales; f__Aciduliprofundaceae;
lineage =  Archae(level k) ->  Euryarchaeota (level p) -> Thermoplasmata (level c) -> Thermoplasmatales (level o) -> Aciduliprofundaceae (level f)


COMPUTATION 

The goal is to generate a random tree from an existing tree. 

Step 1 (TopologyAnalysis.pl)
Analyse the topology of the input tree.
The fasta file is parsed to overview the structure of  the tree. The goal is to know the number of nodes and the number of children of these nodes at each level. The results for each level are stored in a individual file (.dat). The last level (leaves) is stored as a list of identifiers in a separate file.
The tree depth (number of levels) is defined and the list of levels is stored in a separate file.
At the end of the first step we know the total number of nodes in the tree and the distribution of these nodes in the tree.

Step 2 (RandomGeneration.pl)
Generate the new tree
The tree is generated from root to leaves using the data from the previous step. In the resulting tree the number of nodes at each level is conserved (including the number of leaves) but the relations between nodes are different.
First of all the higher level is rooted to a new one (R level) containing only one node named "root". After that, all levels are processed from the highest to the lowest (leaves). The method used to generate the new tree uses a random sampling of the children number distribution. Consequently, at a given level in the newly generated tree, the number of nodes and the distribution of their number of children is similar to the original tree, but the links between nodes are different. 
Leaves are generated at the end and can only be linked to terminal nodes (i.e., a node without a child).


OUTPUT

Two files are generated: a fasta file in the same format as the input and a file containing the list of links between nodes.
An extra output format is available using the script ExtraOutput.pl. This script need the Taxonomy library of Tango software and will generate a Taxonomy object.


SYNOPSIS
perl RandomTaxonomy.pl --input=../../GGdb --name=test
This script will run first perl TopologyAnalysis.pl --input=../../GGdb --name=test
and then perl RandomGeneration.pl --name=test

The GGdb file is the GreenGenes fasta file available here
http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/current_GREENGENES_gg16S_unaligned.fasta.gz
Source: README, updated 2013-01-31