[sbml-annot] SBML Level 3 Package Proposal: Annotation

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,
 Here are my comments regarding the SBML Level 3 Package proposal on
annotation
http://precedings.nature.com/documents/5610/version/1

A. syntax and semantics of containers and collections
Containers (bag, seq, alt) specify *groups* those in which their members may
be ordered (seq) or unordered (bag, alt), or that contain duplicates (bag,
seq) or is unique (alt).
Collections (list) specify *groups* that can only contain the specified
members.

So there are several criticisms in using this constructs for the SBML
annotations.
0 - the relation is from the species to a group which has particular
members. I don't believe this is really desireable because it both dilutes
and confuses the semantics of the relation.
1 - these constructs are seldom (if at all) used in the linked data
community. Their semantics are more amenable to being used to list items in
forms or surveys, where you actually want to order list items (e.g. HTML
ordered/unordered lists), or restrict the value choices (e.g. radio
buttons).
2 - these constructs requires one to create a different kind of SPARQL query
to get at the values. instead of asking for the value of subject, predicate
or subject-predicate expression (e.g. :a :p ?y), one now has to ask for the
member of a collection (e.g. using Jena :a :p [ rdfs:member ?y], or with no
application-specific short cut it gets significantly more ugly).
3 - these constructs are not supported in OWL (they are elements of the
syntax of the language, but not part of what modelers use)
4 - I don't think the cited examples are valid in the context of the stated
intent.
a. Figure 6 shows two annotations linked through the "is" property to a
glucose species. Given the intended semantics of "is" and that ChEBI and
KEGG are two different resources, it seems to be that what is actually meant
is that this species corresponds to (represents) the physical entities
denoted by the ChEBI and KEGG identifiers, and that these should not be
disjoint types.  I really see two distinct statements (in turtle format).
 :meta_glc bqbiol:is <urn:miriam:obo.chebi:CHEBI%3417234>
 :meta_glc bqbiol:is <urn:miriam:kegg.compound:C00234>

b. Figure 7 shows the annotation of a calcium-calmodulin complex, in which
the intent is to state that the complex is composed of calcium and that the
complex is composed of calmodulin. the mereological nature of this statement
basically doesn't require one to state either a conjunction (which would
imply that the value is a member of both types) or disjunction (which would
imply that the value is a member of one or the other types), but rather
should be treated as a set of separate statements

:cacam bqbiol:hasPart <urn:miriam:uniprot:P62158>
:cacam bqbiol:hasPart <urn:miriam:kegg.compound:C00076>

c. Figure 8 shows how bag elements can be separated, but i question the need
to have bag involved at any level here.

d. negative statements - the syntax for negative object assertions are
provided by OWL2

B. predicates and qualifiers

1. "To satisfy RDF, predicates should be nouns,"
knowledge representation languages such as RDF/OWL are agnostic when it
comes to the choice of the characters in a symbol, safe those that are
reserved as elements of the language. The naming of entities is entirely up
to the modeler and has no bearing on the interpretation by a tool over the
data. Thus, the choice of nouns or verbs are entirely within the control of
the modeler. IMHO, verb predicates are more accurate in the nature of the
relation, and improve the quality of tools that want to work with
(controlled) natural language expressions. Ultimately this is a style
choice.

2. the list of new predicates provided in the appendix are not described,
and hence I cannot offer my opinion on them as to their merit or whether
they are instrinsically different relations. However, by looking at the
names themselves, i doubt very much that this is the direction you actually
benefit from.

C. Provenance
1. Statements about attributes
While i find the use of xpath expressions to identify parts of the XML
element that one wants to refer to as being very necessary and interesting,
I fear that the name may not be sufficiently unique and that in a large RDF
graph of such annotations, they would get jumbled up.

2. Statements about statements
The problem with the reification plan is that the assignment of internal
identifiers (rdf:id) is never guaranteed to remain the same, and as such
cannot be considered as linkable data. Thus annotations will necessarily
have to be created and maintained in that same file, and this will preclude
others from commenting on annotations. Another solution is to consider
OWL2's annotation object, which is strongly typed to reify statements, and
may be assigned a stable URI. Additional semantics of annotation

3. Options: In both 1 and 2 above, i might suggest the use of miriam
identifiers for both models and their components. I might also suggest to
investigate the provenance ontology [
http://trdf.sourceforge.net/provenance/ns.html]

3. n-ary relations (referred to as non-binary relations)
n-ary relations are problematic for a large number of reasons including
restriction on the number and type of relations to the decidability of
reasoning. For these and other reasons, OWL2 maintains only binary object
relations (and hence the created annotations would be incompatible with an
OWL knowledge base), which then forces one to adopt more principled/modular
patterns in the design of expressive and reason-able ontologies. Thus, I
would recommend to think about the nature of the entities and the relations
that hold between them.

The use case is "Hexokinase 2 is modified by phosphoserine in position 158".
First, the use case is badly worded, and I think it refers to either :
1 - there exists a variant of hexokinase 2 which contains a phosphorylated
serine at position 158  (because surely hexokinase 2 and modified hexokinase
2 are two different kinds)
2 - there exists a process which phosphorylates hexokinase 2's serine at
position 158 (and hence there is a regular hexokinase + phosphate as input
and a phosphorylated hexokinase as output).

In order to express (1), we might want to state (in turtle syntax)
 :hexokinase-2-PS158
   rdf:type :protein;
   :is-variant-of :hexokinase-2;
   :has-proper-part [
        rdf:type :phosphoserine;
       :has-attribute [
            rdf:type :sequence-position;
           :has-value "158"^^xsd:int]] .

in this way, we maintain the use of binary relations, and we now have new
types which have relations that are appropriate to them.  Thus modelling can
be better controlled, and evolution of new types with additional
restrictions follow along the design pattern. This pattern follows what we
are doing with the Semanticscience Integrated Ontology (SIO) -
http://code.google.com/p/semanticscience/wiki/SIO

Hope the above is helpful in refining the proposal in so that it reflects
more powerful design patterns for the RDF/OWL Semantic Web languages.

m.

-- 
Michel Dumontier
Associate Professor of Bioinformatics
Carleton University
http://dumontierlab.com

[sbml-annot] SBML Level 3 Package Proposal: Annotation

A file format for exchanging computational models in systems biology

[sbml-annot] SBML Level 3 Package Proposal: Annotation