CDAO, The Comparative Data Analysis Ontology
Join the cdao-discuss mailing list
Download the most recent version of CDAO
CDAO stands for "Comparative Data Analysis Ontology", a formalization of concepts and relations relevant to evolutionary comparative analysis, such as phylogenetic trees, OTUs (operational taxonomic units) and compared characters (including molecular characters as well as other types). CDAO is being developed by scientists in biology, evolution, and computer science. In general, ontologies are designed to support formal or automated reasoning. Our aim in developing CDAO is to provide the language support for representing, and reasoning over, phylogenetic data and metadata. When CDAO is fully developed and supported by software, it will be possible for one researcher to represent information with conceptual richness, and for another research (or a computer) to access that same conceptual richness.
CDAO is an open-source initiative: the source code is available here (links above and below). The CDAO project is also open: we have a mailing list that is open to new subscribers; those who show that they can make useful contributions to CDAO are invited to join the list of developers. Please contact us (using the information on this page) if you have questions or comments about CDAO. We are especially interested to hear from projects that might use CDAO. Current projects that are using or experimenting with CDAO include NeXML, Phenoscape and MIAPA.
High-throughput genome sequencing and assembly techniques, together with new information resources, such as structural proteomics, interactomics, transcriptome data from microarray analyses, or light microscopy images of living cells have lead to a rapid increase in the amount of biological data available. As a result, there now exists a vast array of heterogeneous data resources distributed over different Internet sites that cover genomic, cellular, structure, phenotype and other types of biologically relevant information. However, the accumulation of large-scale data is only an indispensable preliminary to the understanding of the principles and fundamental mechanisms of life. A critical stage in this understanding will be the comparative analysis of diverse sequences and the understanding of the evolutionary processes (duplication, loss, recombination…) involved, since they determine the sequence, the structure and the function of macromolecules and define, at the highest level, the biological complexity of organisms [1-3].
A primary goal is therefore to make inferences or gather clues, often based on a comparative approach in which patterns of similarities and differences reveal clues to function, e.g., the locations of genes and regulatory sites in the human genome can be inferred by comparing it with the chimp and mouse genomes [4, 5]. In this context, evolutionary theory provides a powerful framework for performing comparative analyses, for propagating information between different systems and for predicting or inferring new knowledge . Unlike heuristic machine-learning approaches borrowed from computer science, evolutionary analysis treats the items to be compared as homologs that have evolved along paths of common descent (in a tree or sometimes, a network) according to dynamics that reflect evolutionary genetics. Ideally, this framework converts questions about interpreting similarities and differences into theoretically well-posed questions about evolutionary transitions in the states of characters along the branches of a tree. Evolutionary comparative analysis has a long and colorful history, rooted in the early efforts of "numerical taxonomists" and "cladists" (working on the problems of organismal classification) to replace personal judgment with rigorous principles (reviewed in ). Today, evolutionary-based inference systems are playing an increasingly important role in most areas of high throughput genomics, from genome structural and functional annotations to studies of promoters (phylogenetic footprinting) , interactomes (based on the presence and degree of conservation of interacting proteins) , and in comparisons of transcriptomes or proteomes (phylogenetic proximity and co-regulation/co-expression) .
While powerful tools exist for some applications of evolutionary analysis, they remain under-utilized because of the lack of an appropriate informatics infrastructure that makes evolutionary approaches relatively inaccessible and difficult to use. The traditional approach to software for evolutionary studies relies on independent task-specific services and applications, using different input and output formats, often idiosyncratic, and frequently not designed to inter-operate. When an individual user is curating and analyzing a small data set, it may suffice to store the data in a text file and to maintain a lab notebook describing the analysis. When dozens of individuals at different institutions are working together to annotate a genome, or to assemble the tree of life, it becomes crucial to employ standards, automation, traceability, validation, and so on.
In the fall of 2006, the National Evolutionary Synthesis Center (NESCent), sponsored by the National Science Foundation, funded a working group in Evolutionary Informatics (EvoInfo: https://www.nescent.org/wg_evoinfo/), with a mandate to address the infrastructural needs of evolutionary biologists, emphasizing technologies to address the particular needs that emerge when large, diverse, or dispersed sets of data are to be analyzed. The main focus of the EvoInfo working group is to promote interoperability amongst evolutionary analysis software. An operational definition of interoperability is that interoperability exists when objectives can be achieved automatically, using human labor or "mind work" only where absolutely essential (e.g. for tasks that cannot or should not be automated, like fighting fires or answering the phone). The EvoInfo working group is therefore developing community cohesion on issues of standards and interoperability, and facilitating (directly and indirectly) development of interoperable software and data standards.
Software interoperability is currently hindered by syntactic differences in the file formats used by the different applications and by semantic differences, such as naming conventions and terminology. To resolve such discrepancies, formal, structured vocabularies or ontologies are now being introduced in many domains, to constrain the use and interpretation of the terminology employed. These structured depictions or models of known and accepted facts are being built today to make a number of applications more capable of handling complex and disparate information. Ontologies provide an ideal means of representing the fundamental concepts in a domain and the relationships that exist between them and are used for example in artificial intelligence, the semantic web, and software engineering as a form of knowledge representation about the world or some part of it.
The complexity of the molecular biology domain makes the modeling, handling and exchange of data very difficult and in recent years, the utility of ontologies has been clearly demonstrated for the organization and management of biological knowledge . They ensure compatibility between different data resources and software applications and can also be used as a query model for information management systems that include automated inference and reasoning. The most well known biological ontology is the Gene Ontology (GO) , which has become the de facto standard for describing the principal attributes (the molecular function, biological process, and cellular component) of knowledge about gene products. GO is part of an umbrella project, called Open Biomedical Ontologies (http://obo.sourceforge.net/), whose goal is to provide a set of compatible ontologies, which can be used in combination in order to integrate individual data resources into a coherent whole. The ontologies grouped together at the OBO web site cover a wide range of biomedical fields, such as specific organism anatomies, phenotype characters, taxonomic classifications or transcriptomic and proteomic experimental protocols and data. A number of ontologies have also been developed that address particular aspects of molecular sequences, such as gene structure (SO) , RNA (RnaO) , proteins (PRO)  or multiple alignments (MAO) . But a general-purpose ontology for molecular or non-molecular evolution and for comparative data analyses based on evolutionary concepts does not yet exist.
CDAO (Comparative Data Analysis Ontology) provides extensive conceptual coverage, guaranteeing the ability to express concepts encountered in most evolutionary analyses and appearing in the most widely used data formats and application interfaces.
Representing evolutionary concepts, an overview
Technically speaking, the Comparative Data Analysis Ontology (CDAO) is intended to provide a framework for understanding data in the context of evolutionary-comparative analysis. This comparative approach is used commonly in bioinformatics and other areas of biology to draw inferences from a comparison of differently evolved versions of something, such as differently evolved versions of a protein. The entities to be compared, typically called 'OTUs' (Operational Taxonomic Units), may represent biological species, or entities drawn from higher or lower in a biological hierarchy -- anywhere from molecules to communities. The features to be compared among OTUs are rendered in an entity-attribute-value model sometimes referred to as the 'character-state data model'. For a given character, such as 'beak length' (or 'position 20 of alpha hemoglobin'), each OTU has a state, such as 'short' (or 'Alanine'). The differences between states are understood to emerge by a historical process of evolutionary transitions in state, represented by a model (or rules) of transitions along with a phylogenetic tree. CDAO provides the framework for representing OTUs, trees, transformations, and characters. The representation of characters may depend on imported ontologies, e.g., character-states for amino acid characters are based on an imported ontology of amino acids .
Description of some core concepts
- Character State Data Matrix: Taxonomic Unit, Character, Character Stated Datum, State, Coordinate
- Topology: Network, Rooted Tree, Edge, Node
- Transformation: Branch transformation, ancestor state, derived stated
Description of main CDAO concepts and relations useful to annotated evolutionary analysis and phylogenetic trees. Obtained from Prosdocimi et al., 2009 (see below for the complete reference).
Ontologies in Molecular Biology and the relevance of CDAO
Biomedical ontologies have been extensively used in the last years to describe biological knowledge through the definition of the main concepts forming a standard vocabulary for some specific area of interest. Considering the well succeeded approach to describe the massive amount of data from modern biology in a controlled fashion, mainly verified by the widely use of Gene Ontology  in the genomic field, an umbrella resource for biomedical ontologies was recently developed. The OBO foundry  contains dozens of ontologies and presents terms for a variety of biological phenomenons, such as: gene functions, anatomical parts of animals and plants, disease phenotypes, organisms' taxonomy, cell development and so on.
However, considering the unquestionable relevance of evolutionary theory in biology and remembering Theodozius Dobzhansky classical quote: "Nothing in biology makes sense except in the light of evolution", it is surprising to realize the lacking of specific formal ontologies to describe terms for the evolution of molecules and organisms. Evolutionary biology provides a picture of the history of events that allow researchers to understand better the ultimate causes for given phenotypes and it also permits researchers to perform bona fide functional inferences; helping in the understanding of both structure and function of biological information. The absence of an evolutionary ontology may be explained since traditional approaches to software design for evolutionary studies relies on independent task-specific services and applications, using different input and output formats, often idiosyncratic and frequently not designed to inter-operate. The current approach is a part of the Evolutionary Informatics Working Group  at NESCent  and we try to understand these problems and provide solutions such like the development of the present ontology.
For this reason, the development of an ontology founded on evolutionary terms shall provide basis for a standardized and organized data representation, allowing better automatic processing of data and facilitating the information retrieval and the propagation of similar properties derived from common descent among represented instances. The CDAO (Comparative Data Analysis Ontology) here presented provides extensive conceptual coverage, expressing concepts encountered in most evolutionary analyses and appearing in the most widely used data formats and application interfaces.
Description of some evolutionary concepts in CDAO
An example of CDAO annotations for a rooted tree. Once instantiated, concepts can be searched and new information can be propagated in trees.
High-level classes and relations
The current version of CDAO is a kind of free-floating ontology not linked into an "upper" ontology. The reason for this is that, when we first began to develop CDAO, we did not understand where the concepts and relations would fit in terms of available upper ontologies. We're still not entirely sure. The wiki page on UpperOntology is an attempt to introduce some of the relevant issues.
Glossary of evolutionary terms
A glossary of significant evolutionary concepts was developed by a number of authors from the NESCent Evolutionary Informatics group . CDAO inherits this conceptual formalization on its ontology terms.
A number of standard evolutionary cases of analysis was developed and they have been used to evaluate the ontology .
- Wolfe KH, Li WH: Molecular evolution meets the genomics revolution. Nat Genet 2003, 33 Suppl:255-265.
- Doolittle RF: Evolutionary aspects of whole-genome biology. Curr Opin Struct Biol 2005, 15(3):248-253.
- Koonin EV, Wolf YI: Evolutionary systems biology: links between gene evolution and function. Curr Opin Biotechnol 2006, 17(5):481-487.
- Clark AG, Glanowski S, Nielsen R, Thomas PD, Kejariwal A, Todd MA, Tanenbaum DM, Civello D, Lu F, Murphy B et al: Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 2003, 302(5652):1960-1963.
- Prabhakar S, Noonan JP, Paabo S, Rubin EM: Accelerated evolution of conserved noncoding sequences in humans. Science 2006, 314(5800):786.
- Eisen JA: Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 1998, 8(3):163-167.
- Holder M, Lewis PO: Phylogeny estimation: traditional and Bayesian approaches. Nat Rev Genet 2003, 4(4):275-284.
- Zhang Z, Gerstein M: Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements. J Biol 2003, 2(2):11.
- Skrabanek L, Saini HK, Bader GD, Enright AJ: Computational prediction of protein-protein interactions. Mol Biotechnol 2008, 38(1):1-17.
- Snel B, van Noort V, Huynen MA: Gene co-regulation is highly conserved in the evolution of eukaryotes and prokaryotes. Nucleic Acids Res 2004, 32(16):4725-4731.
- Bard JB, Rhee SY: Ontologies in biology: design, applications and future challenges. Nat Rev Genet 2004, 5(3):213-222.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29.
- Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M: The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol 2005, 6(5):R44.
- Leontis NB, Altman RB, Berman HM, Brenner SE, Brown JW, Engelke DR, Harvey SC, Holbrook SR, Jossinet F, Lewis SE et al: The RNA Ontology Consortium: an open invitation to the RNA community. RNA 2006, 12(4):533-541.
- Natale DA, Arighi CN, Barker WC, Blake J, Chang TC, Hu Z, Liu H, Smith B, Wu CH: Framework for a protein ontology. BMC Bioinformatics 2007, 8 Suppl 9:S1.
- Thompson JD, Holbrook SR, Katoh K, Koehl P, Moras D, Westhof E, Poch O: MAO: a Multiple Alignment Ontology for nucleic acid and protein sequences. Nucleic Acids Res 2005, 33(13):4164-4171.
Design and implementation
OWL - From the wikipedia link : The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies, and is endorsed by the World Wide Web Consortium. This family of languages is based on two (largely, but not entirely, compatible) semantics: OWL DL and OWL Lite semantics are based on Description Logics, which have attractive and well-understood computational properties, while OWL Full uses a novel semantic model intended to provide compatibility with RDF Schema. OWL ontologies are most commonly serialized using RDF/XML syntax. OWL is considered one of the fundamental technologies underpinning the Semantic Web, and has attracted both academic and commercial interest.
The data described by an OWL ontology is interpreted as a set of "individuals" and a set of "property assertions" which relate these individuals to each other. An OWL ontology consists of a set of axioms which place constraints on sets of individuals (called "classes") and the types of relationships permitted between them. These axioms provide semantics by allowing systems to infer additional information based on the data explicitly provided. For example, an ontology describing families might include axioms stating that a "hasMother" property is only present between two individuals when "hasParent" is also present, and individuals of class "HasTypeOBlood" are never related via "hasParent" to members of the "HasTypeABBlood" class. If it is stated that the individual Harriet is related via "hasMother" to the individual Sue, and that Harriet is a member of the "HasTypeOBlood" class, then it can be inferred that Sue is not a member of "HasTypeABBlood". A full introduction to the expressive power of the OWL language(s) is provided in the W3C's OWL Guide. (more information at the wikipedia) .
File:ProtegeLogo.gif Although Protégé  is still in a initial alpha implementation version, we have chosen it as the most appropriate software to use. The main issues that guided us to use protégé were:
- It is user friendly and easy-to-learn;
- Presents a large user base and active developers; many of our concerns are being discussed by others on the p4-feedback and protege-owl discussion lists
- Bugs have been fixed and new versions have been published
- Other alternatives are out of date or more difficult to use
Links and publications of interest
- Ontologies General
- Ontology software
- Protégé : the free, open source ontology editor and knowledge-base framework chosen by us to develop CDAO.
- Swoop 
- Doddle 
- Altova SemanticWorks : A commercial product: good visualization of ontologies (including properties).
- Jena : an ontology builder API, not a stand-alone editor
- Alphaworks 
- Sofa 
- SWeDe 
- CmapTools 
- Publications of interest
- Smith B, et al. Relations in biomedical ontologies.  Genome Biol 2005, 6:R46
- Eisen JA. Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis.  Genome Res 1998, 8(3):163-167
- Bard JB and Rhee SY. Ontologies in biology: design, applications and future challenges.  Nature Rev Gen 2004, 5:213-222
- Thompson JD et al. MAO: a Multiple Alignment Ontology for nucleic acid and protein sequences.  Nucleic Acids Res 2005, 33(13):4164-4171
- Hladish T et al. Bio::NEXUS: a Perl API for the NEXUS format for comparative biological data.  BMC Bioinformatics 2007, 8:191
CDAO Download Options
- Main release: http://purl.obolibrary.org/obo/cdao.owl
- Mapping from original identifier scheme to OBO-style identifiers
CDAO published information
- Prosdocimi F, Chisham B, Pontelli E, Thompson JD, Stoltzfus A. Initial implementation of a comparative data analysis ontology. Evol Bioinform Online. 2009 Jul 3;5:47-66. PubMed PMID: 19812726; PubMed Central PMCID: PMC2747124. 
Book chapters published
- Prosdocim, F, Chisham B, Pontelli E, Stoltzfus A, Thompson J. Knowledge standardization in evolutionary biology: the Comparative Data Analysis Ontology (CDAO). In: Evolutionary Biology: concept, modeling and application ed. Pierre Pontarotti (editor). Berlin : Springer, 2009, p. 195-214. 
- Prosdocimi F, Chisham B, Pontelli E, Thompson JD, Stoltzfus A. Framework for a comparative data analysis ontology (C-DAO). 12th EBM  - Evolutionary Biology Meeting at Marseilles. Marseilles, France. September, 2008. 
- Chisham B, Prosdocimi F, Pontelli E, Thompson JD, Stoltzfus A. Framework for a comparative data analysis ontology. Evolution 2008.  Minneapolis, Minnesota, EUA. June, 2008.
- Prosdocimi F, Chisham B, Pontelli E, Thompson JD, Stoltzfus A. CDAO: an evolutionary framework for comparative data analysis. DISL2008  - Data Integration in the Life Sciences 2008. Paris (Evry), France. June, 2008.
- Prosdocimi F, Chisham B, Pontelli E, Thompson JD, Stoltzfus A. CDAO: a semantic web-ontology language to represent evolutionary biology data and analysis.  X-meeting 2009 (The brazilian confrence on bioinformatics). Angra dos Reis (Rio de Janeiro). November, 2009.
The development of an ontology is surely a work that must be done by a scientific research community. This community needs to approve the concepts provided, understand their meanings and scope, and also to be confortable in using them on its own research. If you would like to colaborate in the elaboration of CDAO, suggesting new concepts, new relations between classes or anything else, do not hesitate to contact us by e-mail. Moreover, if you have interesting ideas, suggestions and/or criticisms about representing evolutionary biology terms using a formal vocabulary, please let us know. Our eletronic addresses are shown in the right-side panel.
We want to know you
If you have been using, testing and/or working with CDAO for any specific application, please send us an e-mail reporting your experience*. Suggestions, criticisms and greetings are all welcome. Your voice will be very important to help and guide us into following versions of CDAO. Moreover, we may further produce some new specific applications and sub-ontologies to help in cases on which CDAO has proven to be more useful and adopted by researchers of a given field. Help us to help you.
- Enrico Pontelli: epontell [at] cs.nmsu.edu
- Arlin Stoltzfus: arlin [at] umd.edu
- Julie Thompson: julie [at] igbmc.fr
- Brandon Chisham: bchisham [at] cs.nmsu.edu
- Francisco Prosdocimi: fpros [at] igbmc.fr
Thank you, merci, grazie, obrigado!
Our multi-cultural working group thanks you for visiting this web-site and being interested in our work. Thank you! NESCent and the French ANR (EvolHHuPro: BLAN07-1_198915) are gratefully acknowledged for financial support. We are also in debt with the NESCent Evolutionary Informatics working group for sharing expert knowledge and contributing to the concept glossary. Finally, we thank in advance all the future users of our ontology terms, hoping that our work will be well received by a significative number of researchers working in evolutionary biology. Thanks!
MediaWiki Help if you need it
Click on the following image to upload a new version of the PNG logo image for your project: