Update of /cvsroot/geneontology/go-dev/sql/d2rq-mappings/doc
In directory sc8-pr-cvs2.sourceforge.net:/tmp/cvs-serv27154/d2rq-mappings/doc
Modified Files:
go-d2rq.txt
Log Message:
added new views
added docs to DDL
chnged d2rq mappings
Index: go-d2rq.txt
===================================================================
RCS file: /cvsroot/geneontology/go-dev/sql/d2rq-mappings/doc/go-d2rq.txt,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** go-d2rq.txt 3 Nov 2006 21:41:59 -0000 1.3
--- go-d2rq.txt 6 Nov 2006 20:09:20 -0000 1.4
***************
*** 59,67 ****
}
-
-
The r2dq mapping is available here:
!
==Background==
--- 59,65 ----
}
The r2dq mapping is available here:
! http://geneontology.cvs.sourceforge.net/geneontology/go-dev/sql/d2rq-mappings/
==Background==
***************
*** 77,82 ****
; http://www.godatabase.org
We start off with a brief introduction to some in-depth analysis of
! the nature of the propositions embodied in GO annotation records.
===Ontological analysis of GO annotation===
--- 75,83 ----
; http://www.godatabase.org
+ We also assume familiarity with RDF, N3 notation and SPARQL.
+
We start off with a brief introduction to some in-depth analysis of
! the nature of the propositions embodied in GO annotation records; this
! will form our guide for encoding the annotations as RDF.
===Ontological analysis of GO annotation===
***************
*** 87,92 ****
instances of "cell nucleus" in any portion of my body.
! (at this stage we are careful not to confuse the types and instances
! in reality with their representation in computers)
Gene Product records also represent ''types'' - for example, the term
--- 88,93 ----
instances of "cell nucleus" in any portion of my body.
! (we must be careful not to confuse the types and instances in reality
! with their representation in computers)
Gene Product records also represent ''types'' - for example, the term
***************
*** 117,127 ****
type p53 protein participate_in SOME DNA_repair process. Obviously
there are some instances of the p53 protein which do no such
! thing. For a deeper analysis of this problem, see [Ref: Hill/Smith;
forthcoming]
! In addition, many GO annotation records contain gene IDs rather than
! gene product IDs. There is an implicit extra level of indirection
! here: we do not mean that gene X is located to the ER; we mean the
! gene product encoded by X is located to the ER
==RDF Representation==
--- 118,128 ----
type p53 protein participate_in SOME DNA_repair process. Obviously
there are some instances of the p53 protein which do no such
! thing. For a deeper analysis of this issue, see [Ref: Hill/Smith;
forthcoming]
! In addition, many GO annotation records contain ''gene'' IDs rather
! than ''gene product'' IDs. There is an implicit extra level of
! indirection here: we do not mean that gene X is localised to the ER; we
! mean the gene product encoded by X is localised to the ER.
==RDF Representation==
***************
*** 135,139 ****
The namespaces are not final - they are just for illustration. rdf/owl
! properties use come from:
PREFIX vocab: <http://localhost:2020/resource/vocab/>
--- 136,140 ----
The namespaces are not final - they are just for illustration. rdf/owl
! properties used come from:
PREFIX vocab: <http://localhost:2020/resource/vocab/>
***************
*** 145,154 ****
for illustrative purposes.
===Representing the links between genes and GO types===
In choosing how to represent the propositions in GO annotation files
! as RDF there is a tension between the 'ontologically correct'
representation, and the representation that is most expedient in terms
! of current RDF technology.
For example, whilst it is undeniable that genes and gene products are
--- 146,157 ----
for illustrative purposes.
+ (TODO: finalise ns)
+
===Representing the links between genes and GO types===
In choosing how to represent the propositions in GO annotation files
! using RDF there is a tension between the 'ontologically correct'
representation, and the representation that is most expedient in terms
! of current semantic web technology.
For example, whilst it is undeniable that genes and gene products are
***************
*** 169,175 ****
ro:participates_in go:DNA_repair .
In addition, using a class-level representation will require some kind
! of owl entailment in order to get appropriate query results in some
! cases.
There is a partial assumption with the semantic web that all "data"
--- 172,179 ----
ro:participates_in go:DNA_repair .
+ A simpler encoding will be more efficient and less-error prone.
+
In addition, using a class-level representation will require some kind
! of owl entailment in order to get appropriate query results.
There is a partial assumption with the semantic web that all "data"
***************
*** 183,188 ****
accuracy and simply treat genes and gene products as
instances/individuals. This is of course the natural solution to those
! versed in object-oriented and frame-based modeling - see also BioPAX
! and this paper:
Experience Using OWL DL for the Exchange of Biological Pathway
--- 187,192 ----
accuracy and simply treat genes and gene products as
instances/individuals. This is of course the natural solution to those
! versed in object-oriented modeling or traditional database
! record-oriented modeling - see also BioPAX and this paper:
Experience Using OWL DL for the Exchange of Biological Pathway
***************
*** 190,193 ****
--- 194,204 ----
http://www.mindswap.org/2005/OWLWorkshop/sub37.pdf
+ And
+
+ Alan Ruttenberg, Jonathan Rees and Jeremy Zucker. What BioPAX
+ communicates and how to extend OWL to help it
+ (OWLED 2006)
+
+
However, if we abandon ontological priniciples it is difficult to see
how interoperation will be possible on the semantic web; without
***************
*** 201,204 ****
--- 212,222 ----
service is declared final/stable.
+ One compromise may be the use of so-called "representative
+ instances". These are instances with class-like properties. Details
+ still need to be worked out - for example, should "human p53 protein"
+ be treated as the name of a representative instance, or should
+ annotation records refer to an anonymous representative instance of a
+ class named "human p53 protein"?
+
===Genes and Gene Products===
***************
*** 206,213 ****
appropriate Sequence Ontology class.
To find all 'instances' of SO:protein, you can execute this query:
SELECT * WHERE {
! ?gp rdf:type ?type
}
--- 224,235 ----
appropriate Sequence Ontology class.
+ E.g.
+
+ human:p53 a so:protein ;
+
To find all 'instances' of SO:protein, you can execute this query:
SELECT * WHERE {
! ?gp rdf:type so:protein
}
***************
*** 230,234 ****
be used.
! ===Gene synonyms===
We use vocab:synonym (a placeholder) as predicate
--- 252,256 ----
be used.
! ===Gene/Product synonyms===
We use vocab:synonym (a placeholder) as predicate
***************
*** 241,244 ****
--- 263,297 ----
}
+ We could use rdfs:label, but we prefer to distinguish the prefered
+ name from alternate names.
+
+ We may consider the OboInOWL synonym model here; or SKOS.
+
+ ===Species===
+
+ We use a vocab:in_organism predicate (will be replaced when
+ appropriate RO relation becomes available); the object is an owl:Class
+ of URI NCBITax:nnnn
+
+ The actual taxonomy nodes or graph is not accessible from this sparql
+ endpoint; the idea is that this would be queried/aggregated from
+ elsewhere.
+
+ NCBI Taxonomy in OWL is available from
+
+ http://www.fruitfly.org/~cjm/obo-download
+
+ ===Sequence===
+
+ protein sequence via vocab:has_sequence
+
+ (the test db on which the demo is running lacks sequence data I believe)
+
+ ===DBXrefs===
+
+ uses rdfs:seeAlso predicate; you can query by UniProt, MOD ID, ....
+
+ note the go db is not v complete in this respect
+
===Associations===
***************
*** 248,253 ****
Given the preamble concerning the difficulties in ontologically
pinning down the relationship between a gene and a GO type, we opt for
! an expedient solution of using an undefined vocab:has_role relationship
! between the gene/product instance and the GO class.
Note that having a triple between an instance and a class contravenes
--- 301,310 ----
Given the preamble concerning the difficulties in ontologically
pinning down the relationship between a gene and a GO type, we opt for
! an expedient solution of using a vague, undefined vocab:has_role
! predicate between the gene/product instance and the GO class.
!
! E.g.:
!
! human:p53 vocab:has_role GO:1234567 ;
Note that having a triple between an instance and a class contravenes
***************
*** 272,276 ****
TODO
! allValuesFrom complementOf?
===Annotation, evidence and providence===
--- 329,335 ----
TODO
! represent as has_role link to a class (OWL-AS):
!
! restriction(allValuesFrom complementOf(GO_ID))
===Annotation, evidence and providence===
***************
*** 289,301 ****
The basic idea is as follows. We have a clean separation between
! entities in the bio-domain (p53, DNA repair) and entities in the realm
! of human investigations, experiments and documents.
Annotation is a process which has some kind of agent (human or
! computational), consumes some kind of data source (in GO annotation
! this is typically a record of scientific experimentation such as a
! journal paper) filtered by evidence (here using the OBO evidence code
! terminology, but could in theory use ontologies like OBI) and produces
! propositions.
Propositions are most naturally modeled as RDF statements. The subject
--- 348,360 ----
The basic idea is as follows. We have a clean separation between
! entities in the bio-domain (eg p53, DNA repair) and entities in the
! realm of human investigations, experiments and documents.
Annotation is a process which has some kind of agent (human or
! computational), takes as input some kind of data source (in GO
! annotation this is typically a record of scientific experimentation
! such as a journal paper) filtered by evidence (here using the OBO
! evidence code terminology, but could in theory use ontologies like
! OBI) and produces ''propositions''.
Propositions are most naturally modeled as RDF statements. The subject
***************
*** 306,313 ****
(see associations, above).
SELECT * WHERE {
?gp rdf:type ?type .
?gp rdfs:label ?name .
! FILTER regex(?name, "ab") .
?gp vocab:in_organism ?sp .
?gp vocab:has_role ?role .
--- 365,376 ----
(see associations, above).
+ TODO: show example in n3 here
+
+ Can be retrieved like this:
+
SELECT * WHERE {
?gp rdf:type ?type .
?gp rdfs:label ?name .
! FILTER regex(?name, "abc") .
?gp vocab:in_organism ?sp .
?gp vocab:has_role ?role .
***************
*** 321,333 ****
}
===Terms, the ontology and reasoning===
! The GO database represents GO types and their relationships, as well
! as annotations. The mapping does *not* include the terms and their
! links to eachother.
Of course, this dramatically reduces the utility of the SPARQL
! endpoint when used in isolation - annotation queries should
incorporate the deductive closure of rules that are corollaries of the
definitions in the OBO Relation ontology; otherwise queries for for
--- 384,416 ----
}
+ TODO: currently rdf:statements are conflated with the annotation; in
+ future these will be distinct entities, with instances of the
+ annotationProcess class having a 'posits' predicate linking to the
+ rdf:Statement
+
+ TODO: use bnodes rather than internal database IDs
+
+ ====Provenance====
+
+ vocab:has_source, between annotation and URI representing publication
+
+ TODO: publication instance is untyped; in future this will come from
+ some ontology of documents
+
+ ====Evidence====
+
+ vocab:has_evidence
+
+ TODO: use bnodes rather than internal database IDs
===Terms, the ontology and reasoning===
! The GO database also contains representations of GO types and their
! relationships, as well as the annotations. This RDF mapping does *not*
! include the terms and their links to eachother (this would be
! available elsewhere - see OboInOWL)
Of course, this dramatically reduces the utility of the SPARQL
! endpoint when used in isolation - annotation queries should always
incorporate the deductive closure of rules that are corollaries of the
definitions in the OBO Relation ontology; otherwise queries for for
***************
*** 336,351 ****
However, the idea is that the annotation endpoint will be used in
conjunction with other services, including a service for querying GO
! in OWL, including the deductive closure (which will require OWL
! entailment to traverse the transitive part_of links).
In theory, an intelligent query mediator can combine these services
and perform this query optimally; in practice these services as they
! exist for the semantic web are not particularly efficient. It's early
! days and hopefully this situation will improve; SPARQL is a
! declarative language so in theory many optimisations are possible,
! even enough to get queries as fast as a dedicates relational database
! such as the GO Database (http://www.godatabase.org - admittedly not as
! fast as it could be due to some software engineering inefficiencies on
! my part).
We are currently experimenting with DARQ as a query mediator, and with
--- 419,435 ----
However, the idea is that the annotation endpoint will be used in
conjunction with other services, including a service for querying GO
! in OWL, and performing the deductive closure (which will require going
! beyond RDFS to some fragment of OWL entailment to traverse the
! transitive part_of links).
In theory, an intelligent query mediator can combine these services
and perform this query optimally; in practice these services as they
! exist for the semantic web today are not particularly efficient (TODO:
! evaluate DARQ in more detail). It's early days and hopefully this
! situation will improve; SPARQL is a declarative language so in theory
! many optimisations are possible, even enough to get queries as fast as
! a dedicated relational database such as the GO Database
! (http://www.godatabase.org - admittedly not as fast as it could be due
! to some software engineering inefficiencies on my part).
We are currently experimenting with DARQ as a query mediator, and with
***************
*** 355,358 ****
--- 439,480 ----
SPARQL? Is SPARQL even a good language for OWL?]
+ Another approach is to also wrap the ontology tables in the GO
+ database - this includes a table for pre-computing the transitive
+ closure of relationships; the could be used to effectively fake the
+ small fragment of owl entailment required whilst keeping the basic
+ d2rq query engine at simple rdf entailment. (I think we can do this in
+ r2dq by 'overloading' rdf:type...)
+
+ e.g. (TODO)
+
+ SELECT * WHERE {
+ ?gp vocab:has_role ?class .
+ ?class vocab:transitive_reflexive_part_of ?query_class .
+ ?query_class rdfs:label "endoplasmic reticulum"
+ }
+
+ this finds gps that are localised to subtypes or parts of the ER (may
+ seem counter-intuitive; reflexive part_of means it also traverses is_a
+ DAG). E.g. ER lumen, rough ER.
+
+ note this is a simpler scheme than OWL; the following query expresses
+ the same thing with owl semantics):
+
+ SELECT * WHERE {
+ ?gp vocab:has_role ?role_inst .
+ ?role_inst rdf_type ?class .
+ ?class owl:restriction ?r .
+ ?r owl:someValuesFrom ?query_class .
+ ?r owl:onProperty ro:part_of .
+ ?query_class rdfs:label "endoplasmic reticulum"
+ }
+
+ This may be harder to implement in r2dq.
+
+ Note the definition of ro:part_of is reflexive. There may be some
+ clash with owl semantics here, as properties are not reflexive in
+ owl.
+
+
====Term names====
***************
*** 376,382 ****
http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page
! and
!
! http://www.fruitfly.org/~cjm/obo-download
==Conclusions and future work==
--- 498,506 ----
http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page
! Note the service may later discontinue providing ontology detail such
! as term names; or it may encompass the entire ontology (currently d2rq
! does not have any entailment rules so there wouldn't be much point; it
! can in theory be used with jena performing the entailment but this is
! reputedly v slow).
==Conclusions and future work==
***************
*** 404,405 ****
--- 528,533 ----
some kind of throttle capability on the endpoint to alleviate/stop
killer queries.
+
+ The first application using this will probably be overlaying GO
+ annotations over gene glyphs in the new AJAX-GBrowse
+ (http://genome.biowiki.org).
|