The BioCASE project (www.biocase.org) aims at the implementation of a data access network for European biological collections. This will be both a network of National Nodes offering summary information on collections as well as a network of unit-level (specimens or observation) information providers linked together in a common access system. The latter effort also serves as a contribution to global networking of collection resources, so we want to create a system which is able to interlink with other such systems, leaving the collection holder with various options for participation in such networks.
For data definitions and as the base of the XML document returned from the information provider, BioCASE will use the CODATA/TDWG/BioCASE (CTB) schema currently under development (http://www.bgbm.org/tdwg/codata/Schema/). This includes a mapping to the Darwin Core data elements currently used by DiGIR and several American networks. [Hopefully, we will all agree at some time to use an ABCD schema as a standard.]
BioCASE plans to implement a subset of the DiGIR Protocol to ensure that software components will be interchangeable and that participants in BioCASE can be linked to global initiatives (such as GBIF) with a minimum of incompatibilities in the future.
When studying the DiGIR documents and software, we came across one important caveat, the integration of protocol-specific features in the federation schema used by DiGIR.
The current DiGIR.xsd protocol definition requires the use of abstract types for integrity checking between operators and operands. Consequently, elements to be queried are defined at the data schemas root level and are referenced from the positions within the schema where they are used. From our point of view, this approach has three disadvantages:
(1) Data schema and protocol schema are not independent. Using the data schema in a different context (e.g. exporting collection data into files etc.) requires a modified data schema.
(2) Since abstract elements can only be substituted by root level elements potential query elements have to be defined globally. This means that different sets of query elements for different applications lead to different data schema definitions.
(3) This also means that the ABCD- groups decision to use common type definitions for structurally identical elements wherever possible cannot be maintained.
To overcome this problem without loosing the brilliant integrity checking capabilities of the DiGIR protocol we suggest to decouple the data schema from its protocol specific properties.
Introducing a schema query.xsd containing the root level elements formerly residing within the data schema could be a solution. With this, the DiGIR schema refers to different namespaces for queries and responses. An appInfo element could be added to each element in query.xsd' stating the path to its original data bearing element or attribute within the data schema. An additional parameter in the DiGIR protocol could specify the schema to be used for the response. This would enable data providers to support several output schemas (e.g. Darwin core and CTB) independent from the request schema.
The wrapper software used to implement the protocol for data providers would use two configuration files. The first one maps elements defined with the query.xsd schema to database attributes. The second one would map database attributes to paths within the (original) data-defining schema thus enabling the wrapper to return XML documents complying to the data schema with the response wrapper. Since the CTB schema will provide a large space of potential attributes on the providers side we currently consider developing tools to support the configuration file generation and to offer typical predefined configurations for specific subject domains (e.g. herbarium collections or mapping records).
For pragmatic reasons, BioCASE will probably implement its portal prototype as an API rather than a web service. At least initially the user interface, query distribution, and (private) registry will reside on the same machine. We will take this simplified approach because of unresolved IPR issues concerning access to provider databases in Europe and time pressure for the setup of an initial unit-level network. Using the UDDI registry mechanism at a later stage is of course possible, but for the time being we will rely on using the BioCASE CORM (core metadatabase) as the registry.
Looking forward to your comments
Anton Gntsch, Markus Dring, and Walter Berendsohn