[Atox-user] Possible future Atox format (long)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

One of the technical things I'm pondering at the moment is the
automatic placement of "fill" tokens in converting an Atox format to a
(sort of) context-free grammar (as I've mentioned, I've been able to
implement the fill behavior in itself, using the "follow-set"
formalism of LL(1) parsers). The pause in development caused by this
pondering has also let me reconsider the Atox format. I've thought
about several options; now, as in the past, I've basically thought
about two different approaches:

  1. The current grammar-like sort of format

  2. An XSLT-like template format

The second type appeals to me because it would probably be quite easy
to use, but I think it might be hard to implement if the underlying
mechanism is some sort of standard parser (which is what I'm planning
now). It could be that XSLT-like templates could be used as a grammar
extension mechanism -- who knows.

The current kind of format is basically a relative of the kind of
grammars you see in formal language theory -- and in XML schema
languages... The latter is an interesting point. I've previously been
considering making the Atox format an extension of the DTD format or
the W3C XML Schema language, but I've found them to be quite complex
(and I'm sure there were several other reasons why I decided against
this).

However: I just discovered *Relax NG* (http://www.relaxng.org). For
those who don't know it, it's another XML schema language developed by
*OASIS* (http://www.oasis-open.org). It seems to have pretty solid
backing and well-known people (such as James Clark, who has even
written an emacs mode that can perform Relax NG validation of a
document natively) behind the format itself. The reason I'm excited by
it is covered by its two first key features (as listed on
relaxng.org): It's simple and it's easy to learn. Also, it "can
partner with a separate datatyping language".

So: I'm thinking of returning to my idea of basing the Atox format on
a schema language, namely Relax NG. It is even possible that Relax NG
itself (or a subset thereof) can be used, with a custom separate
datatyping language for specifying text patterns (document structure
is fully covered by Relax NG).

A nifty thing about Relax NG is that it also has an alternative
compact non-XML syntax, which is another thing I've contemplated (and
implemented, in some of the prototypes) for Atox.

As a quick illustration, here is an example I've snipped from the two
tutorials on the Relax NG site:

  <addressBook>
    <card>
      <name>John Smith</name>
      <email>js...@ex...</email>
    </card>
    <card>
      <name>Fred Bloggs</name>
      <email>fb...@ex...</email>
    </card>
  </addressBook>

This would normally be the input for a Relax NG processor -- and the
output of Atox.

Here is a Relax NG pattern for this, using the XML syntax:

  <element name="addressBook"
           xmlns="http://relaxng.org/ns/structure/1.0">
    <zeroOrMore>
      <element name="card">
        <element name="name">
          <text/>
        </element>
        <element name="email">
          <text/>
        </element>
      </element>
    </zeroOrMore>
  </element>

In a simple approach, all that would be needed is the addition of some
form of pattern (regular expression) to each of the <text/> tags.

Now here is the same pattern using the compact syntax:

  element addressBook {
    element card {
      element name { text },
      element email { text }
    }*
  }

I've really wanted to do this sort of thing for Atox, but I thought
that going with a standard like XML for the format language was a Good
Idea(tm). But since Relax NG is such a solid standard anyway, and it
has an alternative XML syntax, I think this is an excellent candidate.

To get beyond the simplest block-level structures, we need some form
of patterns (most likely regular expressions). How can these be
integrated with Relax NG?

To quote Sect. 5 from the tutorial on the compact format:

  RELAX NG allows patterns to reference externally-defined datatypes.
  RELAX NG implementations may differ in what datatypes they support.
  You can only use datatypes that are supported by the implementation
  you plan to use. The most commonly used datatypes are those defined
  by [W3C XML Schema Datatypes].

So: Atox could be a Relax NG implementation that supports
application-specific datatypes.

Again, quoting:

  A pattern consisting of a name qualified with a prefix matches a
  string that represents a value of a named datatype. The prefix
  identifies the library of datatypes being used and the rest of the
  name specifies the name of the datatype in that library. The prefix
  xsd identifies the datatype library defined by [W3C XML Schema
  Datatypes].

Further, datatypes may have parameters. An example using the standard
W3C XML Schema datatype 'string' with minLength and maxLength:

  element email {
    xsd:string { minLength = "6" maxLength = "127" }
  }

This sort of construct makes it easy to conjure up a specific
regexp-matching type for Atox (in fact, there are patterns in the W3C
XML Schema datatypes -- I don't think they're entirely applicable
here, but I'll have to look into it):

  element email {
    ax:string { pat = "[a-z]@([a-z]\.)+[a-z]+" }
  }

Or something like that. To define the datatype prefix 'ax' one would
add something like

  datatypes ax = "http://hetland.org/atox"

to the beginning of the file.

I'm not 100% sure about the details of the datatype format, but since
it's only a small part of the language, it should be possible to give
it quite a lot of thought and still come to a conclusion. (As for
parsing the compact format, there is a full grammar for it at the
Relax NG web site, so that's not a problem.)

This will, of course, mean full incompatibility with previous
versions, but that might have happened anyway, when transitioned to
flex/bison (or something similar). By using something thoroughly
standardized, there is some hope that future backward
incompatibilities may be avoided.

So -- if anyone has actually read this far <wink>: Input would be
appreciated.

-- 
Magnus Lie Hetland              "Wake up!"  - Rage Against The Machine
http://hetland.org              "Shut up!"  - Linkin Park