RE: XMLDB as backend? (was RE: [Digir-dev] summary of proposals)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> -----Original Message-----
> From: Guentsch, Anton [mailto:a.g...@bg...]
> Sent: Thursday, October 31, 2002 10:48
> To: Vieglais, David A
> Cc: DiG...@li...
> Subject: RE: XMLDB as backend? (was RE: [Digir-dev] summary of proposals)
> 
> Hi Dave,
> 
> this is going in a completely different direction, I think. We did not
> propose to use xpath as query language nor do we expect the providers to
> be xml data sources or xml wrapped relational databases or xml files
> exported from a database. The idea was to stay with the digir query
> language and using a well known syntax (xpath) to point to data bearing
> elements/attributes within an access schema conform xml document.
> 
> The idea of exporting entire collection databases into xml files to be
> queried with xpath was discussed in the DIGIR meeting during TDWG and
> rejected. If I remember it correctly, the arguments were
> 
> a) You need to reexport your data frequently as they change.

Certainly true- but most of our data providers do this anyway to separate
the working database from the public one.  

> b) You have to export your data into an xml file for every schema you want
> to support.

Certainly not true.  The XMLDatabase would be accessed through an API like
XML:DB, which is essentially equivalent to ODBC or JDBC.  Transformation of
the response records would still be performed by the wrapper software
(DiGIR) as they are being streamed to the client.

> c) The technology (databases storing xml natively, xml wrappers for
> relational databases) was considered immature and therefore risky in
> consideration of our tight time frame.
> 

This is something that is developing quite quickly.  Both Oracle and
Microsoft have stable implementations of XPath interfaces to their databases
(the later for SQLServer 2000).  I discarded (open source) xmldatabases as a
viable option a few months ago because they were definitely unstable and
experimental.  Now it seems that although a bit buggy, they do work quite
effectively.

> I am not quite sure whether these arguments really hold water but I just
> wanted to let you know the Campinas conclusions.
> 

Ok.  I don't think that using an XMLDB as the backend is going off in a
completely different direction though.  The solution you suggested for
specifying the path to a concept in a query would certainly work (and be
easy to implement in the current version of the provider software).  The
only differences from the DiGIR protocol point of view would be the ability
to specify a path to a concept in the filter, and perhaps, optionally
completely specifying the filter as an XPath statement.  In the first case,
the DiGIR provider would transform the DiGIR filter to an XPath statement to
apply against the back end, and in the latter case would simply pass the
query straight through to the backend.

The rest of the specification could I think remain essentially unchanged.

Cheers,
  Dave V.

> Best
> Anton
> 
> > -----Original Message-----
> > From: Vieglais, David A [mailto:vie...@ma...]
> > Sent: Donnerstag, 31. Oktober 2002 16:16
> > To: DiG...@li...
> > Subject: XMLDB as backend? (was RE: [Digir-dev] summary of proposals)
> >
> >
> > > -----Original Message-----
> > > From: Guentsch, Anton [mailto:a.g...@bg...]
> > > Sent: Thursday, October 31, 2002 05:16
> > > To: DiG...@li...
> > > Subject: RE: [Digir-dev] summary of proposals
> > >
> > > I support Markus' solution of (mis?)use XPath to achieve a
> > means to query
> > > elements in a hierarchical schema. The structural changes
> > within the digir
> > > protocol would be minimal. A digir expression
> > >
> > > <equals>
> > > 	<darwin:Species>MyString</darwin:Species>
> > > </equals>
> > >
> > > would for example turn into something like
> > >
> > > <equals>
> > > 	<queryTarget
> > > location="/unit/identification/name/species">MyString</queryTarget>
> > > </equals>
> > >
> > > With this, a structured access schema can be used and in
> > many cases this
> > > will be the same as the result schema. It is simply up to a
> > provider or an
> > > aggreement within a network which schemas are supported.
> > > On the other hand we are loosing some inherent xml integrity when
> > > constructing querys. This is because the element
> > > (/unit/identification/name/species) is not bound to its
> > valid operators
> > > anymore. But I think we could live with that and a set of
> > valid operators
> > > for a given element could probably be derived from the xml
> > base type used
> > > in the access schema (?).
> > >
> > > Anton
> >
> >
> > Specifying a path in the query is not really the same thing
> > as using XPath
> > for a query.  XPath is quite a bit richer than that- the query for the
> > example below would be more like:
> >
> > //record[unit/identification/name/species="MyString"]
> >
> > which locates the <record> elements that contain
> > unit/identification/name/species elements whose content
> > exactly equals the
> > string "MyString".
> >
> > There are a number of functions available in XPath (for
> > substring and so
> > forth) plus the concept of axes, which basically allows
> > references relative
> > to a node (child, parent type operations for example).
> >
> > Full use of XPath would be a really nice solution, allowing
> > very complete
> > and complex queries to be performed on complex schemas such
> > as that proposed
> > by the BioCASE project.  However, integrating full XPath into
> > DiGIR would be
> > a fairly significant task- a parser would need to be
> > developed/ borrowed /
> > adapted and a model for mapping XPath onto relational
> > databases would have
> > to be developed.  These are not trivial tasks and as far as
> > I'm aware, there
> > are no real good solutions for the later unless a specific
> > database schema
> > is used.
> >
> > Unless - we used an XMLDatabase as the backend data store for
> > DiGIR.  After
> > being confronted with the horror (:-) of the ABCD schema, I
> > think using an
> > XML Database as the backend is the only realistic solution if we want
> > something up and running fairly quickly.  To use this
> > approach, the data
> > from the existing collection database is extracted into a
> > couple of large
> > XML files (or several thousand smaller ones depending on the
> > design of the
> > database engine) with each record conforming to a consistent
> > schema - such
> > as the ABCD schema.
> >
> > Then querying the database is simply a matter of passing the
> > XPath query
> > onto the backend, then either streaming the matching records
> > back to the
> > client.  Very easy.  With a little effort you could even
> > morph the outgoing
> > records to match a particular schema or subset.
> >
> > I find this solution so intriguing that I started messing
> > with a couple of
> > XML database engines yesterday.  Two in particular are worth
> > mention: eXist
> > (http://exist-db.org/) and Xindice
> > (http://xml.apache.org/xindice/).  Both
> > are open source, and both are in a reasonably stable state of
> > development.
> > More importantly, they both support the emerging XML:DB
> > (http://www.xmldb.org/) standard API for interfacing to XML
> > Databases.  In
> > some preliminary tests with eXist yesterday I loaded 16,000
> > records into the
> > database (as a single XML file - about 9mb).  The engine
> > indexed the file in
> > about a minute.  Queries against the engine were every fast
> > with no queries
> > taking longer than 30ms.
> >
> > If this approach is adopted for DiGIR then the protocol would
> > change quite a
> > bit - the query structure would be much simpler (just an
> > XPath string) and
> > would be maintained by the W3C (less work for us).  The
> > backend would be
> > interfacing to another (emerging) standard based on the
> > XML:DB API.  The
> > complexity of the data being stored would not matter- just
> > define the schema
> > for a common (community) record format and push the records into the
> > database.  Overall, a lot of plusses in functionality and standards
> > conformance.
> >
> > Regards,
> >   Dave V.
> >
> >
> > >
> > > > -----Original Message-----
> > > > From: Döring, Markus
> > > > Sent: Mittwoch, 30. Oktober 2002 16:16
> > > > To: DiG...@li...
> > > > Subject: [Digir-dev] summary of proposals
> > > >
> > > >
> > > > As far as I understood there are several subjects we have to
> > > > think about:
> > > >
> > > >
> > > > 1) Logical operators supported by DiGIR.
> > > >  I also think we should adopt just the three basic logical
> > > > operators AND OR NOT instead of all different binary
> > > > operators possible. Is there any reason to implement only
> > > > binaries operators (possibly due to some XML restrictions
> > > > with substitutable elements) ?
> > > >
> > > >
> > > >
> > > > 2) Query mechanism
> > > >  So far I see two (three) major choices:
> > > >  a) Sticking to substitutable elements, which means we can
> > > > only query global elements.
> > > > This results in different query/response schemas, that all
> > > > have to be mapped to a DB by a provider. A minimum of two
> > > > schemas (one for the query, one for the response) has to be
> > > > configured (mapped to the DB). The concepts used in both
> > > > schemas do not have to be the same, so that the linkage of
> > > > the query and the response relies upon the providers schema
> > > > configurations.
> > > >
> > > >  b) using XPath or XQuery.
> > > > This eliminates the need for a seperation of a query and
> > > > response schema, as these methods can be used with highly
> > > > structured XML documents. Any number of supported schemas
> > > > (concepts) can be configured by a provider, but for the ease
> > > > of use a single one (ABCD) would be enough for the system to work.
> > > > As introducing XPath or XQuery would mean a severe change of
> > > > the existing protocol, speaking for the BioCASE project this
> > > > probably means a long way until a stable DiGIR protocol would
> > > > be available to be implemented. Considering the greater
> > > > complexity of XQuery and its instable state, my favorite
> > > > solution would be an implementation of XPath to point to a
> > > > "queried" element and use the DiGIR protocol to implement the
> > > > comparison and logical operators as it is now.
> > > > Thus it would be still possible to submit a query based on
> > > > DarwinCore, but retrieve the information as an ABCD document.
> > > > But one would also be able to use the ABCD schema without
> > > > modifications for the query and the response.
> > > >
> > > > So the choice of chosing between XPath and substitutable
> > > > elements also influences the need for seperate
> > query/response schemas.
> > > >
> > > >
> > > > 3) License
> > > > From what I learned in the discussion so far, I would prefer
> > > > a BSD style license. Even if companies like Microsoft have
> > > > used code without contributing to the community, there are
> > > > other examples that do contribute (Apple for example is
> > > > supporting OpenSource in a large way now, because their new
> > > > OS is build on BSD-Unix).
> > > >
> > > >
> > > > 4) Requesting subsets of the response schema.
> > > > If a provider only supports complex schemas like ABCD for
> > > > responses, it would be nice and a lot faster to have a
> > > > mechanism to return only a subset of the whole schema. If
> > > > someone is interested in a high number of records, but only
> > > > needs the taxon name (eg. for statistics), it would be pretty
> > > > slow to return all the information available for all
> > > > requested records. I havent thought about this for long, but
> > > > maybe we could optionally pass a XSL stylesheet (or a URL to
> > > > it) with the request, so that the resulting large XML
> > > > document can be transformed already on the server into a much
> > > > smaller document. It would still be time and resource
> > > > consuming to produce the large files in the first place and
> > > > then transform then into smaller ones, but at least we reduce
> > > > a lot of network traffic.
> > > >
> > > >
> > > >
> > > > Markus
> > > >
> > > > --
> > > >  Markus Döring
> > > >  Botanic Garden and Botanical Museum Berlin Dahlem,
> > > >  Dept. of Biodiversity Informatics
> > > >  Königin-Luise-Str. 6-8, D-14191 Berlin
> > > >  Phone: +49 30 83850-284
> > > >  Email: m.d...@bg...
> > > >  URL: http://www.bgbm.org/
> > > >
> > > >
> > > >
> > > > -------------------------------------------------------
> > > > This sf.net email is sponsored by:ThinkGeek
> > > > Welcome to geek heaven.
> > > > http://thinkgeek.com/sf
> > > > _______________________________________________
> > > > DiGIR-developers mailing list
> > > > DiG...@li...
> > > > https://lists.sourceforge.net/lists/listinfo/digir-developers
> > > >
> > >
> > >
> > > -------------------------------------------------------
> > > This sf.net email is sponsored by: Influence the future
> > > of Java(TM) technology. Join the Java Community
> > > Process(SM) (JCP(SM)) program now.
> > > http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
> > > _______________________________________________
> > > DiGIR-developers mailing list
> > > DiG...@li...
> > > https://lists.sourceforge.net/lists/listinfo/digir-developers
> >
> >
> > -------------------------------------------------------
> > This sf.net email is sponsored by: Influence the future
> > of Java(TM) technology. Join the Java Community
> > Process(SM) (JCP(SM)) program now.
> > http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
> > _______________________________________________
> > DiGIR-developers mailing list
> > DiG...@li...
> > https://lists.sourceforge.net/lists/listinfo/digir-developers
> >