[Ringing-lib-discussion] XML Format

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello there.

The recent posting to change-ringers about methods.ringing.org has
prompted me to join fray regarding the
use of XML in computational campanology. My personal bias is toward the
use of Java (and my Java ringing
class library) and XSLT processing of methods data. A quick email to
Martin Bright came up with the
suggestion that I get things going here. So here goes...

Using the Xerces Java parser (version 2.6.2) and the current method.xsd
schema with
full validation turned on I get the following schema errors:

[Error] method.xsd:78:36: InvalidRegex: Pattern value
'(([-xX]|[A-HJ-NP-WYZa-hj-np-wyz0-9]+)\.?)*' is not a valid regular
expression. The reported error was: ''-' is an invalid character range.
Write '\-'.'.
[Error] method.xsd:150:46: src-resolve: Cannot resolve the name
'xlink:type' to a(n) 'attribute declaration' component.
[Error] method.xsd:150:46: s4s-elt-invalid-content.1: The content of
'linkedPerformanceType' is invalid.  Element 'attribute' is invalid,
misplaced, or occurs too often.

Which can be resolved by modifying the following elements:
  <simpleType name="pnType">
    <restriction base="xsd:string">
      <pattern value="(([\-xX]|[A-HJ-NP-WYZa-hj-np-wyz0-9]+)\.?)*"/>
                         ^add this escape character
    </restriction>
  </simpleType>

  <complexType name="linkedPerformanceType">
    <complexContent>
      <extension base="m:performanceType">
    <attribute ref="xlink:show" default="none"/>
                          ^replace path with a known xlink reference
      </extension>
    </complexContent>
  </complexType>

This second case is a bit of a fudge just to get rid of any schema
errors. It's not
clear what the xlink:type is trying to achieve either. Finding a
standard schema
for xlink was not easy either so I would question its usefulness overall
but that's
a debate for later as I have other more pressing comments.

Running a query returns a document with the following structure:

<methods xmlns="http://methods.ringing.org/NS/method"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xmlns:db="http://methods.ringing.org/NS/database"
         version="0.1" db:page="0" db:pagesize="100" db:rows="7">
  <method xmlns:a="http://methods.ringing.org/NS/method"
          xmlns:default="http://methods.ringing.org/NS/method"
          xmlns:sql="http://boojum.org.uk/NS/XMLServer" id="m15469">
    <name>Yorkshire</name>
...
    <meta>
      <db:timestamp>2005-05-12T13:24:54</db:timestamp>
    </meta>
  </method>
....

Wow, so many namespaces! On further anaylsis, those on the <method> tag
are redundant
as the document will parse without them. However I believe the use of
the id attribute
is incorrect.
The supporting text for the schema states:
"This contains a unique ID for the method within the database."
My understanding of XML ID elements (which pre-date XML schema) is
supported by the
following quote from the XML Schema definition: "the scope of an ID is
fixed to be
the whole document".

My first point here is that if you wish to have a unique database
reference it
should not be of type ID, simply a number or whatever as its scope goes
beyond the
document generated as the results of a query.
I have used IDs (and corresponding IDREFs) in those cases where a
relationship is
required to be expressed between two elements in an XML document that
doesn't fall
into the standard hierarchical model provided by the document (e.g.
networks of
nodes and the links between them).
In a ringing context I could see them being used in a document that
contains method
definitions and touches. Several touches could "point" to the same
method definition
(via an ID) using an IDREF.
In such a case (as in a network definition) the actual value of the ID
is irrelevant
and can be generated on the fly for that document instance. The only
constraint being
that all IDs are unique within that document. For me the rule of thumb
is: don't use
an ID if you don't have a corresponding IDREF to refer to it.

The second point is one of separation of concerns: the id attribute you
propose
relates to a reference in your database; it is not a fundamental
attribute of a method.
I therefore believe it should be replaced by a more general mechanism
allowing a method
definition to include an annotation for a database-specific id for
example within
the <meta> tag.

Having suggested this, I don't particularly like the <meta> tag being
part of a method
definition either. It is noise; it doesn't contain any information I
would want to
query a method database for. In a similar way, the <methods> tag as it
currently
stands also has "noisy" attributes with the db: namespace.prefix. I
think a mechanism
for allowing a list of methods is required but it is more general and
should not be
encumbered with attributes for one specific query mechanism. This would
allow other
sites to provide lists of methods in a standard way.

If you feel that query information is necessary then it should be
provided by
a separate wrapper tag:

<db:results xmlns:db="http://methods.ringing.org/NS/database"
            db:page="0" db:pagesize="100" db:rows="7">
   <methods xmlns="urn:cccbr-org-uk:methods-1.0">
      <method>
      ...
      </method>
   </methods>
</db:results>

I would also propose that the version attribute of the <methods> tag
should be
eliminated and version idetification be incorporated into the namespace
URI. This
would make it very much easier to build a system that can parse either
version 1
or version 2 documents using standard entity resolver parsing techniques.

Over the years my preference has eveolved towards the use of URNs as
URIs and not
URLs as with the latter parsers have been known to hit the named site
when trying
to resolve references. Make it clear that they are just a formal naming
convention.
The example above brings some of these suggestions together.

The schema currently allows "nillable" fields. E.g.
"<firsthand xsi:nil="true"/> indicates that the method has never been
rung to
a peal on handbells."
I have never needed to use this feature as the three common states for a
field
are easily covered in the following much simpler syntax:
    <tag>value</tag>      - a value is present for "tag"
    <tag></tag> or </tag> - an empty value exists for "tag" (e.g.zero
length string)
    no tag element        - "tag" does not exist (i.e. is nil)
This keeps the generated document much simpler to read for humans and to
parse
in programs: there is no need to declare the xsi namespace (which
clutters the
document); absent fields are just that, absent, keeping the document
readable; and
when parsing something like <firsthand xsi:nil="true"/> I would have to
write
extra code in a (SAX) parser to spot the xsi:nil attribute as a special
case and
process the tag in a completely different manner to what I would do if
it read
<firsthand date="2005-05-19"/>.
For me, this last case is the most compelling reason for not using
nillable fields.
Also I have just done some expreiments with XPath expressions in an XSLT
stylesheet:
where I selected all <firsthand> nodes for processing [e.g.
method/preformances/firsthand]
and found the results included those with xsi:nil="true" as well as
those without.
This means extra coding would have to be added to stylesheets if nil
fields were
to be ignored and it would have to be done for each nillable field. I
think the
simple absence of a tag indicating a nil value is much more intuitive
and simpler
to process.

Here endeth the first lesson. I hope I haven't preached too much but I
fear another
sermon is likely to follow regarding some aspects of method definition
but I wanted
to get the discussion going on these aspects first.

Regards,
Gary Howard