From: Magnus L. H. <ma...@he...> - 2004-06-07 23:16:32
|
One of the technical things I'm pondering at the moment is the automatic placement of "fill" tokens in converting an Atox format to a (sort of) context-free grammar (as I've mentioned, I've been able to implement the fill behavior in itself, using the "follow-set" formalism of LL(1) parsers). The pause in development caused by this pondering has also let me reconsider the Atox format. I've thought about several options; now, as in the past, I've basically thought about two different approaches: 1. The current grammar-like sort of format 2. An XSLT-like template format The second type appeals to me because it would probably be quite easy to use, but I think it might be hard to implement if the underlying mechanism is some sort of standard parser (which is what I'm planning now). It could be that XSLT-like templates could be used as a grammar extension mechanism -- who knows. The current kind of format is basically a relative of the kind of grammars you see in formal language theory -- and in XML schema languages... The latter is an interesting point. I've previously been considering making the Atox format an extension of the DTD format or the W3C XML Schema language, but I've found them to be quite complex (and I'm sure there were several other reasons why I decided against this). However: I just discovered *Relax NG* (http://www.relaxng.org). For those who don't know it, it's another XML schema language developed by *OASIS* (http://www.oasis-open.org). It seems to have pretty solid backing and well-known people (such as James Clark, who has even written an emacs mode that can perform Relax NG validation of a document natively) behind the format itself. The reason I'm excited by it is covered by its two first key features (as listed on relaxng.org): It's simple and it's easy to learn. Also, it "can partner with a separate datatyping language". So: I'm thinking of returning to my idea of basing the Atox format on a schema language, namely Relax NG. It is even possible that Relax NG itself (or a subset thereof) can be used, with a custom separate datatyping language for specifying text patterns (document structure is fully covered by Relax NG). A nifty thing about Relax NG is that it also has an alternative compact non-XML syntax, which is another thing I've contemplated (and implemented, in some of the prototypes) for Atox. As a quick illustration, here is an example I've snipped from the two tutorials on the Relax NG site: <addressBook> <card> <name>John Smith</name> <email>js...@ex...</email> </card> <card> <name>Fred Bloggs</name> <email>fb...@ex...</email> </card> </addressBook> This would normally be the input for a Relax NG processor -- and the output of Atox. Here is a Relax NG pattern for this, using the XML syntax: <element name="addressBook" xmlns="http://relaxng.org/ns/structure/1.0"> <zeroOrMore> <element name="card"> <element name="name"> <text/> </element> <element name="email"> <text/> </element> </element> </zeroOrMore> </element> In a simple approach, all that would be needed is the addition of some form of pattern (regular expression) to each of the <text/> tags. Now here is the same pattern using the compact syntax: element addressBook { element card { element name { text }, element email { text } }* } I've really wanted to do this sort of thing for Atox, but I thought that going with a standard like XML for the format language was a Good Idea(tm). But since Relax NG is such a solid standard anyway, and it has an alternative XML syntax, I think this is an excellent candidate. To get beyond the simplest block-level structures, we need some form of patterns (most likely regular expressions). How can these be integrated with Relax NG? To quote Sect. 5 from the tutorial on the compact format: RELAX NG allows patterns to reference externally-defined datatypes. RELAX NG implementations may differ in what datatypes they support. You can only use datatypes that are supported by the implementation you plan to use. The most commonly used datatypes are those defined by [W3C XML Schema Datatypes]. So: Atox could be a Relax NG implementation that supports application-specific datatypes. Again, quoting: A pattern consisting of a name qualified with a prefix matches a string that represents a value of a named datatype. The prefix identifies the library of datatypes being used and the rest of the name specifies the name of the datatype in that library. The prefix xsd identifies the datatype library defined by [W3C XML Schema Datatypes]. Further, datatypes may have parameters. An example using the standard W3C XML Schema datatype 'string' with minLength and maxLength: element email { xsd:string { minLength = "6" maxLength = "127" } } This sort of construct makes it easy to conjure up a specific regexp-matching type for Atox (in fact, there are patterns in the W3C XML Schema datatypes -- I don't think they're entirely applicable here, but I'll have to look into it): element email { ax:string { pat = "[a-z]@([a-z]\.)+[a-z]+" } } Or something like that. To define the datatype prefix 'ax' one would add something like datatypes ax = "http://hetland.org/atox" to the beginning of the file. I'm not 100% sure about the details of the datatype format, but since it's only a small part of the language, it should be possible to give it quite a lot of thought and still come to a conclusion. (As for parsing the compact format, there is a full grammar for it at the Relax NG web site, so that's not a problem.) This will, of course, mean full incompatibility with previous versions, but that might have happened anyway, when transitioned to flex/bison (or something similar). By using something thoroughly standardized, there is some hope that future backward incompatibilities may be avoided. So -- if anyone has actually read this far <wink>: Input would be appreciated. -- Magnus Lie Hetland "Wake up!" - Rage Against The Machine http://hetland.org "Shut up!" - Linkin Park |