From: Aaron J. M. <am...@pc...> - 2005-01-27 02:07:29
|
On Jan 26, 2005, at 3:22 PM, davila wrote: > Could you share any further info/details on your CDAT initiative ?... > sure, tables should support (help to differentiate) gene trees from > species trees, (phylogenomics ?) etc. I'll attempt to summarize CDAT (Character Data and Trees) for the group: The basic premise of the CDAT schema is that you have a database of sequences (duh) and features on these sequences (using a very simple feature & feature_location schema, with the possibility of subfeatures), from which certain subsets (colloquially named "Families") may be involved in one (or more) multiple sequence alignments; in this case, each multiple sequence alignment is considered the result of a computational "experiment" (i.e. you might generate many different MSAs with a variety of approaches), and thus you may have many MSAs for a given Family (and a given sequence may participate in multiple Families). Phylogenetic trees are then another layer of "experimental data" obtained from an MSA, which may generate many trees per MSA (think bootstrapping, in addition to alternative topology-estimation techniques). CDAT spends much of it's "schema space" dealing with various phylogenetic aspects of the multiple sequence alignment itself (represented as a state matrix, with either individual sequence characters or in fact any character/feature "state" in a given cell of the matrix, where columns of the matrix represent positions in the MSA, and rows of the matrix represent separate OTUs). Interestingly, this arrangement allows ancestral sequence/character/feature states (in a probabilistic framework) to also be stored and queried. CDAT does not deal with species trees as an experimental datatype; rather, the NCBI taxonomy tree is assumed to be fixed. However, this could easily be extended using the same logic as above. In short, this schema flavor allows one to perform (what I think are) very interesting analyses, including (but of course not limited to): * site-specific analysis of positive/negative selection in the context of sequence features (domains, PROSITE motifs, introns, secondary structure predictions, etc etc). * ancestral feature state estimation (e.g. intron gain/loss events) * measuring relationships between disparate sequence features (e.g. introns, domains, active sites, transcriptional binding sites, etc), with the possibility of filtering these relative to various evolutionary properties (orthology/paralogy, mutational rates, taxonomic slices, etc) I must admit, there is a significant drawback to this schema: while powerful, the extreme normalization of character states yields thousands of rows for a given MSA (e.g. an MSA composed of 20 OTUs of 500 character states yields 10,000 rows of state cells; any ancestral reconstruction will generate another 10,000 rows of data). For 10,000's of Families, this limits the utility of the database to only a very few (i.e. 1) experimental setups (but, conversely, for just a few 1000 families, allows many alternative experiments to be considered). So I tend to have separate instantiations for each experimental study I perform. -Aaron -- Aaron J. Mackey, Ph.D. Dept. of Biology, Goddard 212 University of Pennsylvania email: am...@pc... 415 S. University Avenue office: 215-898-1205 Philadelphia, PA 19104-6017 fax: 215-746-6697 |