|
[Obo-discuss] Submission of the ProPreO ontology
From: will york <will@cc...> - 2006-06-05 17:55
|
Dear OBO collaborators, We would like to submit our proteomics process ontology (ProPreO) for=20 inclusion in the OBO ontologies. A description of the ontology and how=20 to access it is given below. Thank you for your consideration. Will York - Complex Carbohydrate Research Center, University of Georgia Amit Sheth - Large Scale Distributed Informations Systems Laboratory,=20 University of Georgia Satya Sahoo - Large Scale Distributed Informations Systems Laboratory,=20 University of Georgia _ProPreO: comprehensive Proteomics data and process provenance Ontology= _ Proteomics discipline and glycoproteomics in particular, are focused on=20 two core objectives: a) Identification of a biomolecule =96 /what is it?/__ b) Quantification of the identified biomolecule =96 /How much of it is=20 there?/__ High-throughput experimental protocols for proteomics are rapidly=20 maturing and generating vast amounts of data. Similar to genomics, the=20 limiting step in this scenario will be the computational and related=20 analytical tools that are available to process this data and generate=20 useful information. Many queries in proteomics involve comparison of the proteome in=20 different organisms, tissues, or cells in different developmental or=20 disease states. But, proteomics experimental protocols are characterized=20 by heterogeneity in sources of sample (biological organism), process to=20 generate data (separation techniques or mass spectrometric instruments),=20 parameters used in the process (instrumental parameters or separation=20 method parameters), and data formats used to store the data. Hence,=20 proteomics research requires not only finding specific data sets=20 obtained using the relevant biological sources, but also require one to=20 ensure that the data sets are comparable. For example, differences in=20 the sample preparation, data acquisition or data processing can=20 invalidate a comparison. Provenance, the information regarding the=20 ancestry of a dataset and the description of how the data is created,=20 transformed and processed, forms the foundation which allows for=20 multiple proteomics datasets to be compared in a relevant manner.=20 ProPreO, a proteomics process ontology, models not only /data=20 provenance/ but also /process provenance/ to enable consistent and=20 coherent comparison as well as analysis of proteomics datasets. Hence,=20 ProPreO is one of the first ontologies focused on capturing=20 comprehensive provenance information in proteomics and related fields. As part of the bioinformatics research in the National Center for=20 Research Resources (NCRR) Integrated Technology Resource for Biomedical=20 Glycomics, ProPreO ontology is one of the two ontologies we have=20 developed. The other related ontology is GlycO, a domain ontology to=20 model the structure and function of glycans. These two ontologies enable=20 the creation of a computational framework for the annotation, retrieval=20 and analysis of high-throughput experimental proteomics and=20 glycoproteomics data, in order to facilitate the discovery of biological=20 knowledge that it embodies. We adhered to four major criteria during the development of ProPreO: a) *Logical rigor*: We are using ProPreO for annotation of experimental=20 proteomics data. Using this annotated experimental data; information=20 management application will not only be able to store, retrieve, and=20 integrate multiple datasets but also infer implicit knowledge that will=20 provide insight to proteomics researchers for hypothesis formulation and=20 validation. Hence, to allow computational tools to use ProPreO for=20 reasoning purposes, we ensured the absence of incorrectly determined=20 classes, incorrect or inappropriate naming schemes, and ill-defined=20 relationships between concepts in the ProPreO schema. ProPreO schema=20 includes 390 rigorously defined classes, 32 generic relations and 172=20 specific restrictions on the generic relations to correctly describe=20 each concept and its relation to other concepts. b) *Amenability to existing bio-medical ontologies:* It is now well=20 understood and accepted that the life sciences domain requires multiple=20 ontologies to manage the inherent complexities of the domain. Hence, in=20 the scenario involving multiple but related ontologies, it is critical=20 that these ontologies can be used in an integrated manner by semantic=20 applications. We have followed the Basic Formal Ontology (BFO) (/Smith=20 B. et. al./ 2002) approach in class and relationship creation in=20 ProPreO. The three top-level classes of ProPreO are =93data=94 (datasets = and=20 parameter data), (experimental) =93instrument=94, and (experimental) =93t= ask=94.=20 Additionally, we created the relations in ProPreO by defining generic=20 and easily understandable relations at top-level classes. Using various=20 restrictions, we defined the application of the generic relations for=20 each class thereby effectively and efficiently modeling the=20 characteristics of each concept and its relation with other concepts=20 accurately. Currently, we are working on issues related to the=20 integration, mapping and alignment of ProPreO with ontologies listed in=20 the Open Biomedical Ontologies (OBO) repository. c) *Use of OWL-DL language*: The Web Ontology Language (OWL) has three=20 flavors namely, OWL-Lite, OWL-DL and OWL-Full. As we planned ProPreO=20 ontology to be used by computational applications while being as=20 accurate as possible in expressing the inherent complexity of the=20 proteomics experimental domain, we chose OWL-DL as the language for=20 ProPreO. OWL-DL enables us to be expressive while ensuring acceptable=20 computational properties. d) *Populated ontology*: We believe that an ontology schema is of=20 limited use without real world knowledge. We have populated ProPreO with=20 instances corresponding to concepts modeled as part of the ontology=20 schema. ProPreO has 3.1 million instances and 18.6 million triples. This=20 population of ProPreO with million of real world instances enables us=20 build computational tools that integrate the large volumes of=20 high-throughput experimental data within an overarching semantic=20 framework and reason over it for knowledge discovery. These four criterions has enabled ProPreO to provide the formal semantic=20 foundation for modeling and incorporation of comprehensive provenance=20 information in wide ranging, high-throughput proteomics research. ProPreO (version: 0.5) is available for download in the following=20 variants at http://lsdis.cs.uga.edu/projects/glycomics/propreo/: a) *ProPreO schema*: The schema of the ontology featuring its 390=20 classes and attendant relations. This is an /*.owl/ file which is best=20 viewed using the Prot=E9g=E9 ontology development environment=20 (http://protege.stanford.edu/). The ProPreO schema file is relatively=20 small and hence may be used to gain an understanding of the structure of=20 the ontology and its applicability to various scenarios. b) *ProPreO populated ontology*: This file includes the 3.1 million=20 instances and hence is a relatively large /*.owl/ file. ProPreO=20 currently is populated with instance related to human tryptic peptides,=20 their parent proteins and related enzyme entities. This populated=20 ProPreO ontology may be used as foundation for developing various=20 semantic applications that leverage its instances and its comprehensive=20 provenance framework. For citation and further details on ProPreO: Satya S. Sahoo, Christopher=20 Thomas, Amit Sheth, William S. York, and Samir Tartir, =93Knowledge=20 Modeling and its application in Life Sciences: A Tale of two ontologies=20 <http://lsdis.cs.uga.edu/library/download/p1088-sahoo.pdf>,=94 the 15^th=20 World Wide Web (WWW, 2006) conference=20 <http://www2006.org/programme/item.php?id=3D1088http://www2006.org/progra= mme/item.php?id=3D1088>,=20 Edinburgh, UK, May 2006. |
| Thread | Author | Date |
|---|---|---|
| [Obo-discuss] Submission of the ProPreO ontology | will york <will@cc...> |