Re: [Ringing-lib-discussion] XML Format

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Martin Bright wrote:

> > [Error] method.xsd:78:36: InvalidRegex: Pattern value
> > '(([-xX]|[A-HJ-NP-WYZa-hj-np-wyz0-9]+)\.?)*' is not a valid regular
> > expression. The reported error was: ''-' is an invalid character range.
> > Write '\-'.'.
>
> Many regex parsers would allow this.  The relevant standard (Appendix
> F.1 of XML Schema Part 2) is contradictory:
>
> >       * The [, ], - and \ characters are not valid character ranges;

I'm inclined to assume that the inclusion of '-' in this
list is a mistake.  Given the third bullet point, it simply
doesn't make sense.  I've emailed the relevant W3C mailing
list: let's see if they can clarify things.

> >       * The ^ character is only valid at the beginning of a =B7positive
> >         character group=B7 if it is part of a =B7negative character gro=
up=B7
> >       * The - character is a valid character range only at the
> >         beginning or end of a =B7positive character group=B7.
>
> I suppose there's no harm in putting the backslash in to make it work.

Agreed.  I've now done this.

> > [Error] method.xsd:150:46: src-resolve: Cannot resolve the name
> > 'xlink:type' to a(n) 'attribute declaration' component.
> > [Error] method.xsd:150:46: s4s-elt-invalid-content.1: The content of
> > 'linkedPerformanceType' is invalid.  Element 'attribute' is invalid,
> > misplaced, or occurs too often.
>
> I think maybe the fault here lies with whatever schema you're using for
> XLink -- that should be a valid attribute declaration.

I've put a suitable XLink schema here:

  http://www.ex-parrot.com/~richard/schemas/xlink.xsd

Can you try using that one and tell us whether you still
have problems?

> > Finding a standard schema for xlink was not easy either
> > so I would question its usefulness overall but that's a
> > debate for later as I have other more pressing comments.

There's a good reason for this.  XLink does not (currently)
have a normative schema, but I don't see why this should be
an issue -- it's easy enough to provide one.

Having said that, I'm not one of XLink's greatest fans, and
if you have an alternative suggestion, I'd be interested to
hear it.

> > Wow, so many namespaces! On further anaylsis, those on the <method> tag
> > are redundant
>
> Yes, I know.  It's an issue with DBIx::XMLServer.  It's quite a long way
> down the list of priorities to fix, though, because the extra
> declarations make no semantic difference to the document.

Sorting this is complicated.  Whilst libxml2 has a
clean_namespaces() method, this only removes duplicate
namespace declarations, not unused ones.  As there are no
duplicate namespaces, this doesn't help.  (Finding out which
namespace declarations are unused is a difficult problem.
How do tell, in general, whether a text node or an attribute
value contains a QName or something else sensitive to
namespace bindings, such as an XPath expression?)

That said, we should aim to sort this out eventually.

[Martin: The sql namespace can probably be suppressesd with
an exclude-result-prefixes attribute in xmlout.xsl.]

> > The second point is one of separation of concerns: the id attribute you
> > propose
> > relates to a reference in your database; it is not a fundamental
> > attribute of a method.
>
> No, but historically it's been common for almost everything in any XML
> document to have an ID attribute which contains some arbitrary ID.  This
> is an exception to the general rule that irrelevant data should be put
> in a different namespace.

Martin's comment has just reminded me of the W3C's xml:id
specification.

  http://www.w3.org/TR/xml-id/

The suggestion is that future schemas should put IDs in the
xml namespace.  At the moment I think doing this will break
more than it fixes, but we may wish to revisit this decision
in the future.  (In particular, the ability of parsers to
correctly access fragment identifiers -- i.e. #mXXXX on a
URL -- might break.)

> > Having suggested this, I don't particularly like the <meta> tag being
> > part of a method
> > definition either. It is noise; it doesn't contain any information I
> > would want to
> > query a method database for.

Martin has already responded to this, so I'm not going to,
except to say that all our search script currently puts in
the <meta> elt is a database timestamp.  This is something
for which I can easily imagine wanting to query the
database.  If I have a local (perhaps off-line) database, I
might want to regularly sync this to the server database.
Downloading just those methods changed since my previous
snapshot was created is an obvious way of doing this.

And if you don't like having this in the output, you can
always set the 'fields' parameter to specify which fields
you want.

  http://methods.ringing.org/query.html#fields

(Thinking about it, we might want to add a way of saying all
fields except those in a given list.)

> Instead of the <meta> element, we would have like to allow arbitrary
> elements from other namespaces as direct children of the <method>
> element.  But it seems that XML Schema won't allow you to specify this -
> something to do with being a regular language.  RAS or DFM can no doubt
> expand.

Yes.  Basically we had a choice.  Either we could describe
the <method> element as an <xsd:sequence>, in which case we
would have been allowed to put elements from other
namespaces there.  Similarly we could have put the
performances and classification data directly there.  (They
can't be in the current schema as they can occur multiple
times.)  The cost of this flexibility is that we would have
had to specify an order for all of these elements, and the
XML would only have been valid if these elements were in the
right order.

The reason for this, as Martin says, is that an
<xsd:sequence> is effectively taken as a grammar for a
regular language, and keeping track of the number of times
elements have occured rapidly becomes extremely difficult.
(The schema requried grows combinatorially with the number
of elements.)

The alternative, which is what we've decided to do, is to
describe it with an <xsd:all> element which allows child
elements to occur at most once.  It also does not allow
elements from other namespaces.  We get around these
restrictions by having container elements, such as the
<meta>, <refs>, <performances> and <classification>
elements.  I think we felt that this was preferable to having
an arbitrary order in which the child elements had to occur.

> > I would also propose that the version attribute of the <methods> tag
> > should be
> > eliminated and version idetification be incorporated into the namespace
> > URI. This
> > would make it very much easier to build a system that can parse either
> > version 1
> > or version 2 documents using standard entity resolver parsing technique=
s.

Namespaces aren't referenced via entity references so I
don't see how this is relevant.  (Or are you suggesting a
DTD that adds an implicit namespace declaration on the root
entity?  If so, I think this would be a very bad idea.)

I assume you're referring to the XML Catalog-like techniques
that can be used to select a schema for a namespace.  This
helps, but so long as you keep the XML Schema documents
backwards compatibile (which should be easy as the
conceptual schemas need to be backwards compatible), this is
a non-issue.  Always using the most recent schema for the
namespace might result in lots of unnecessary schema for
unknown elements, but should otherwise be fine.

Finally, it should be remembered that in many real-world
applications, you don't actually use the schema -- it's a
simply a piece of documentation on what is allowed in the
XML.  The parser presumably already knows this.

> There's a big difference between not telling you
> the date of the first peal and telling you that it definitely hasn't
> been pealed.

Of course, we're not actually saying it definitely hasn't
been pealed, we're saying that our database has no knowledge
of it being pealed.  This, of course, is not an absolute
statement, but is dependent on when the database was last
updated (and our ability to update and query the database
without cocking up, but let's ignore that).  Given that, it
*might* make more sense to use an attribute in the db
namespace, though given xsi:nil exists for exactly this sort
of thing, it's sensible to use it.

(We might want to add a global timestamp to signify when our
database was synced against its upstream data source
(currently the MC website).)

Thinking further, we *are* inconsistent in our use of
xsi:nil, as method names are handled differently from
anything else.  For a method name, there are three things
that I might want to convey in the XML:

 - the method is named, but has the null name (i.e. it is
   Little Bob);

 - according to the database, the method is unnamed; or

 - it is unspecified whether the method is named.

Currently, we use xsi:nil for the former case, ignore the
second case (we have no unnamed methods in the database at
the moment), and omit the element in the latter case.

Really we should be able to distinguish all three cases.

RAS