Re: [Gusdev-gusdev] GUS 3.5 Release

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Jan 26, 2005, at 3:22 PM, davila wrote:

> Could you share any further info/details on your CDAT initiative ?... 
> sure, tables should support (help to differentiate) gene trees from 
> species trees, (phylogenomics ?) etc.

I'll attempt to summarize CDAT (Character Data and Trees) for the group:

The basic premise of the CDAT schema is that you have a database of 
sequences (duh) and features on these sequences (using a very simple 
feature & feature_location schema, with the possibility of 
subfeatures), from which certain subsets (colloquially named 
"Families") may be involved in one (or more) multiple sequence 
alignments; in this case, each multiple sequence alignment is 
considered the result of a computational "experiment" (i.e. you might 
generate many different MSAs with a variety of approaches), and thus 
you may have many MSAs for a given Family (and a given sequence may 
participate in multiple Families).  Phylogenetic trees are then another 
layer of "experimental data" obtained from an MSA, which may generate 
many trees per MSA (think bootstrapping, in addition to alternative 
topology-estimation techniques).

CDAT spends much of it's "schema space" dealing with various 
phylogenetic aspects of the multiple sequence alignment itself 
(represented as a state matrix, with either individual sequence 
characters or in fact any character/feature "state" in a given cell of 
the matrix, where columns of the matrix represent positions in the MSA, 
and rows of the matrix represent separate OTUs).  Interestingly, this 
arrangement allows ancestral sequence/character/feature states (in a 
probabilistic framework) to also be stored and queried.

CDAT does not deal with species trees as an experimental datatype; 
rather, the NCBI taxonomy tree is assumed to be fixed.  However, this 
could easily be extended using the same logic as above.

In short, this schema flavor allows one to perform (what I think are) 
very interesting analyses, including (but of course not limited to):

* site-specific analysis of positive/negative selection in the context 
of sequence features (domains, PROSITE motifs, introns, secondary 
structure predictions, etc etc).
* ancestral feature state estimation (e.g. intron gain/loss events)
* measuring relationships between disparate sequence features (e.g. 
introns, domains, active sites, transcriptional binding sites, etc), 
with the possibility of filtering these relative to various 
evolutionary properties (orthology/paralogy, mutational rates, 
taxonomic slices, etc)

I must admit, there is a significant drawback to this schema: while 
powerful, the extreme normalization of character states yields 
thousands of rows for a given MSA (e.g. an MSA composed of 20 OTUs of 
500 character states yields 10,000 rows of state cells; any ancestral 
reconstruction will generate another 10,000 rows of data).  For 
10,000's of Families, this limits the utility of the database to only a 
very few (i.e. 1) experimental setups (but, conversely, for just a few 
1000 families, allows many alternative experiments to be considered).  
So I tend to have separate instantiations for each experimental study I 
perform.

-Aaron

--
Aaron J. Mackey, Ph.D.
Dept. of Biology, Goddard 212
University of Pennsylvania       email:  am...@pc...
415 S. University Avenue         office: 215-898-1205
Philadelphia, PA  19104-6017     fax:    215-746-6697