[exprla-devel] Re: [xpl] SOME IDEAS
Status: Pre-Alpha
Brought to you by:
xpl2
|
From: reid_spencer <ras...@re...> - 2002-01-31 07:20:00
|
--- In xpl-dev@y..., "Garreth Galligan" <garreth.galligan@o...> wrote:
Well I agree with pretty much everything Jonathan has said,
particularly 12.1 and 12.2 given the group's current lack of low-
level engineering experience (myself included). May as well approach
something we have a realisitc chance of finishing.
> AN XPL DOCUMENT TYPE IS A PROGRAMMING LANGUAGE,
> AND ITS DTD IS ITS GRAMMAR. DOCUMENTS OF ONE
> XPL TYPE ARE PROGRAMS IN ONE XPL LANGUAGE.
Very good :) Now the X in XPL is defined. I think this is the nearest
we have got to a 'mission statement' of definition for XPL. Anyone?
A first possible step would be to compile a list of reference URLs
related to some of the issues raised here. That way we can all taken
a dip in a common knowledge-base.
----- Original Message -----
From: Jonathan Burns
To: xpl@e...
Sent: Saturday, April 29, 2000 2:43 AM
Subject: Re: [xpl] SOME IDEAS
Michael Lauzon wrote:
> I have made some XPL tags, which when I have time I'll
> upload them as a text file to our groups file site on Onelist.
> We can go from there,though this is a good start...also remember
> XPL is supposed to be a compiled language, so if anyone knows
> how to program in Java (while also using the XML parser [for
> Java] from James Clark; http://www.jclark.com/) would be
> helpful. :)
Michael, are there working documents, any notes or old
email, on how XPL was conceived? Are we going to be the
ones who define the original charter? I ask because I'll
be injecting some weird stuff below, and I don't want to
be trying to hijack the group against its real intent.
(Later...) I've scanned your proposed tags. They're enough
to prototype an algorithmic language (like C, Pascal, Basic);
right now that's what we need.
On with the show...
THOUGHTS ON PARSING XPL(s)
Thanks to Robert Hopkins, for prototyping some XML-ized
Basic-like source code (dubbed "XBasic" here). I think
it throws a lot of major issues into relief...
1. First and fundamentally, XBasic is meant to be parsed.
There will be an XBasic Grammar, in the form of a DTD,
and an XBasic Document Parser, say in the form of a Java
module.
The Grammar determines whether a block of text is truly
XBasic. This lets us guarantee that only correct XBasic
ever gets past the Parser.
2. There is an intimate relationship between XML document types
and programming languages: both are defined by grammars.
This is a very strong condition, and it provides enormous power.
All the flexibity of programming languages, and of XML document
types, is due to it.
3. The XBasic example implies:
AN XPL DOCUMENT TYPE IS A PROGRAMMING LANGUAGE,
AND ITS DTD IS ITS GRAMMAR. DOCUMENTS OF ONE
XPL TYPE ARE PROGRAMS IN ONE XPL LANGUAGE.
To me, this is the heart of XPL.
(So now I've clubbed you over the head with that ...)
4. In computer science tradition, XPL document types or
DTDs are called "regular languages". There is a huge body
of published results on "regular languages and automata".
The main users of these results are compiler programmers,
and so every text in compiler design begins with an
introduction to regular languages, and how to define them
in Backus-Naur Form (BNF). BNF has been slightly expanded,
and integrated with lexical scanning, resulting in
Extended BNF (EBNF), in which DTDs are written.
5. What is an "automaton"? It is a finite state machine
which parses input according to a grammar. More obviously,
it is a lookup table, in which a parser looks up
{present machine state, incoming token} -> {next machine state}
And the point of this is, that the particular entries
in the lookup table are exactly what the grammar requires;
so that a correct sequence of tokens (e.g. XBasic elements)
causes the machine to pile up more and more recognized
elements, until it arrives at a state called, say,
"<XBasic-Parser-State:XPL-Document>"
at which it has recognized the "total document element",
so to speak; whereas an incorrect sequence of tokens
will lead it to a state called
"<XBasic-Parser-State:Error-condition> (error-number)"
for one error number or another.
<footnote-mark> * </footnote-mark>
6. What are these "tokens"? Well, they are integers.
In a BNF parser, the elements of the grammar(usually
beginning with characters, punctuation marks and strings)
form a numbered list. Each element is known internally
by its token, which is its number in the list.
When the automaton is in one state and waiting for input,
the input comes as a token, which determines which particular
element is incoming.
(The content of the element is available at this time,
but that is irrelevant to which element type is being
tested. And it's the type that determines grammatical
correctness.)
7. The key result in the theory of regular languages is,
FOR EVERY REGULAR LANGUAGE, THERE IS
AN AUTOMATON WHICH PARSES IT.
Better still, there are somewhat complex algorithms
which take in a grammar, and produce the very automaton
(i.e. element list and lookup table) which parses it.
XML, and all markup languages, are regular AND SIMPLE.
In truth, they are a compiler-writer's dream, because
the tags can guide the parsing process. This eliminates
so much ambiguity and second-guessing, such as occurs
with non-tagged languages like C. With most programming
languages, a sequence of a few minor elements will not
be enough to specify what larger element is being
recognized. Nested tags eliminate this problem. Parsing
XPLs can therefore be fast. AND EASILY PROGRAMMED.
8. So fine for theory. Is there a free software tool
which does the job for us mortals? There is.
The tool is Yacc (or if we're in a GNU C environment,
Bison). Yacc takes a BNF grammar, and generates a parser
for it, in the form of a C binding, an element list,
and a quite large lookup table. The latter two form the
automaton for the grammar.
(The C binding is mostly for handling the content of elements,
rather than their type.)
Yacc output looks hideous. But once we have compiled it,
with a standard C compiler, and linked the result into our
favourite C program, then our C program has the ability
to parse the grammar. With the speed of a greased weasel.
Because table lookup is as simple as can be. This is why
just about all compilers are built around Yacc parsers.
9. At this point, we have a rough analogy:
In C, the process goes like this:
Yacc: <BNF grammar> XBasic.y -> <lookup table etc> y.tab.c
gcc: <C source> y.tab.c -> <C lib> y.tab.o
gcc: <C lib> y.tab.o,main.o -> Executable, including XBasic
In Java, using say James Clark's XP parser, it goes:
XP: <EBNF grammar> XBasic.dtd -> <Java> com.jclark.xml.parse
The application's Java code is interfaced as callbacks to
the com.jclark.xml.parse, so that the parser can activate
the code as it recognizes elements. (At least I think that's it.)
10. Now let me emphasize that we have to go through one
of these processes, just to get a parser which recognizes
well-formed XBasic when it sees it. And whichever way we go,
the situation is already fairly messy. That is, I know how
messy the Yacc route can get; I'm not experienced in Java,
so the XP route may be less messy than it looks.
The Java in [XML Complete], which is called from Microsoft's
MSXML parser, looks obscure to me - but then, callback code
always does. For what it's worth, James Clark's expat, a C
toolkit for adding XML parsing to C programs, is being built
into Netscape 5 and Perl - so clearly he knows what he's
doing, and the open-source communities must agree.
11. If you feel you're getting out of your depth here,
don't worry. The essence of parsing is this two-step
process:
(1) Parser-Generator : Grammar -> Parser
(2) Parser: Correct XBasic -> "This is a true XBasic document."
or
Parser: Anything Else -> "This is not XBasic, and <here>
is the first element which is
not as expected."
I'm saying that there is already a body of theory about this,
and a free C-based parser-generator technology. Any old-school
compiler hacker would reach for the Yacc manual without a
second thought. However, (1) Java is cross-platform, and
downloadable via HTML, and (2) EBNF may not be easy to
convert to good old BNF. So Java parser-generators may
be the way to go, and also Java parsers.
Comparing Yacc with the Java-based XP:
11.1 Yacc would probably produce an XPL parser faster;
but since loading and running an XPL application
requires the parser-generator to be produced ONCE,
and XPL programs to be parsed MANY TIMES, this is
not a compelling advantage by itself.
11.2 However, the XPL parser which Yacc produces would
certainly be faster: as in, There goes that greased
weasel. Producing a parser in the form of an automaton
is equivalent to compiling the parser. This is an
advantage which should eventually be addressed.
(I'd be interested in the exercise of writing a
lookup function for the Java Virtual Machine (JVM).
Half as fast as that there weasel, and cross-platform,
and dynamically downloadable like Java.)
11.3 Yacc may be overkill for markup languages, where
the elements are introduced by tags.
11.4 Yacc would create an XPL parser in the form of
a pair of C data structures, i.e. a list of
tokens (elements) and an array of integers
(lookup table). A little programming work would
be required to convert this into Java (but we
would only have to do that work once EVER).
The one thing we should NOT do, is hand-code a Java
program specifically for parsing XPL programs. Because
every time we changed the XPL specification during
development, we would have to modify the parser, and
debug and test it. This would slow research and
development badly, and unnecessarily.
So it seems that
parser-generator: xpl.dtd -> xpl.parser
is the only way to go.
12. Interestingly to me, some of the most active members
of the group are pretty new to programming, just as I am
pretty new to Java and XML. This means:
12.1 We should NOT take any long detours through C programming.
12.2 We SHOULD get as much leverage as possible from existing
Java efforts, such as James Clark's.
So, regardless of any advantages I think Yacc has, it should
not sway the group from its path in this experimental phase.
In other words, we should be following up Michael's and
Robert's initiative: to specify a straightforward
algorithmic language like Basic, and implement the parser
on James Clark's XP.
13. HOWEVER, I think there's a lot more to the total
situation than that. All the above is only the beginning
of what the XBasic example says to me.
<smiley> Whew. <\smiley>
Just about enough for now, but one last thing.
I think I see a niche for myself here, filling out the FAQ.
Can we settle on a mail-title prefix, say [xpl.faq], for
people to send in likely-to-be-frequently-asked questions?
I'll make it my business to collate, edit and recirculate
all answers, or else make up my own.
ASK ANYTHING, people. The more obvious, the better.
Aly's "What, in a very small nutshell, is compilation?"
is PERFECT for the Web community which we will want to
entertain, enlighten, and draw into our orbit. I'm going
to work hard on this one, it's a foundation stone.
More anon
Jonno
<footnote> "*",
<arfur-daley>
Yer gotta 'ave at least one sentence, that's full of
semicolons, nested clauses, parentheses and
indentations and occupies the paragraph, Terence. Otherwise the
punters will lose respect.
</arfur-daley>
</footnote>
----------------------------------------------------------------------
--------
pcworldani.gif
----------------------------------------------------------------------
--------
To unsubscribe from this group, send an email to:
xpl-unsubscribe@o...
--- End forwarded message ---
|