[Digir-protocol] Multiple schemas

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

As Anton has mentioned via e-mail we currently expect that the full data
model for collection data will be some version of the ABCD schema, which is
heavily nested, with multiple optional sections (including strict choices)
and with repeating subelements.  These characteristics do not seem easy to
fit in the current DiGIR substitutable element search model.  

In Brazil we therefore acknowledged that there does not necessarily have to
be a single common schema for both query and results.  The suggestion from
the meeting was that the query schema should be in effect Darwin Core 3, and
that the result schema should be the ABCD model.

In fact the current DiGIR model uses the federation schema for three
logically separable purposes: as a foundation from which to construct the
query, as a logical schema to which physical database schemas are mapped,
and as the record structure for results.  

Personally I remain somewhat uneasy with having different schemas for the
different purposes since I would like it to be possible for software to
validate all of our data mappings and I am not sure that we could easily do
it between two or more arbitrary schemas.  In effect the implementor of each
provider software application would be responsible for interpreting the
relationships between the different schemas.  In practice my qualms are
probably just a matter of philosophy but they remain.  

It also seems to me that separating the schemas ignores one of the major
benefits which may follow from the DiGIR protocol (although I acknowledge
that it is outside the concern of DiGIR as a query and transfer protocol).
That benefit would come from having standard methods for a software tool
(such as the PHP Provider) to automate the mapping between a query, a
physical data model and a result structure.  If all of these relate to the
same schema, the process may be tricky but it is logically just a matter of
programming.  If multiple schemas are used for the different aspects, at the
very least need extra definitions would be required to define how these
mappings are to be performed.  This seems unnecessarily complex.  (I believe
that this issue is less important for BioCASE, since there the protocol will
be used simply for transfer and there is therefore no need for DiGIR to be
able to define the other mappings internal to the provider.)

My hope (based on the combination of the DiGIR meetings and the ABCD
sessions) is that any future version of Darwin Core will in fact model
documents which are valid under the ABCD Schema (a small subset of the total
set of fields, probably corresponding to what most people will in fact be
able to populate within the schema).  I think that this should be possible
and will in fact allow us in practice to maintain a single schema without
any philosophical issues about how to perform correct mappings between the
schemas.  

We could define certain characteristics of schema elements which make them
unavailable as query elements (e.g. some combination of depth in document,
optionality and cardinality).  The problem is that even Darwin Core will
include repeating subelements (e.g. for collector or higher taxon).  This
will mean that we need to understand how to model these within DiGIR
queries.

The bottom line here is that I think our level of agreement in Brazil was in
part because BioCASE needed to be able to separate the query and result
schemas whereas I assumed that we would ultimately be able to combine both
as different subsets of the same schema.  We ignored larger questions about
what the DiGIR protocol may or may not be able to achieve because the
specific case which interests us first can probably work without answering
them.  Speaking both from my role in GBIF and as a software architect I
would like to see DiGIR become a common transport and query protocol for use
in any subject domain in which we want to exchange data.  For that reason I
would like to resolve these more abstract issues before we go too far.

Donald

---------------------------------------------------------------
Donald Hobern
Programme Officer for Data Access and Database Interoperability
Global Biodiversity Information Facility Secretariat
Zoological Museum - University of Copenhagen
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483 Fax: +45-35321480 E-mail: dh...@gb...
---------------------------------------------------------------