[exprla-devel] Re: [xpl] SOME IDEAS
Status: Pre-Alpha
Brought to you by:
xpl2
From: reid_spencer <ras...@re...> - 2002-01-31 07:20:00
|
--- In xpl-dev@y..., "Garreth Galligan" <garreth.galligan@o...> wrote: Well I agree with pretty much everything Jonathan has said, particularly 12.1 and 12.2 given the group's current lack of low- level engineering experience (myself included). May as well approach something we have a realisitc chance of finishing. > AN XPL DOCUMENT TYPE IS A PROGRAMMING LANGUAGE, > AND ITS DTD IS ITS GRAMMAR. DOCUMENTS OF ONE > XPL TYPE ARE PROGRAMS IN ONE XPL LANGUAGE. Very good :) Now the X in XPL is defined. I think this is the nearest we have got to a 'mission statement' of definition for XPL. Anyone? A first possible step would be to compile a list of reference URLs related to some of the issues raised here. That way we can all taken a dip in a common knowledge-base. ----- Original Message ----- From: Jonathan Burns To: xpl@e... Sent: Saturday, April 29, 2000 2:43 AM Subject: Re: [xpl] SOME IDEAS Michael Lauzon wrote: > I have made some XPL tags, which when I have time I'll > upload them as a text file to our groups file site on Onelist. > We can go from there,though this is a good start...also remember > XPL is supposed to be a compiled language, so if anyone knows > how to program in Java (while also using the XML parser [for > Java] from James Clark; http://www.jclark.com/) would be > helpful. :) Michael, are there working documents, any notes or old email, on how XPL was conceived? Are we going to be the ones who define the original charter? I ask because I'll be injecting some weird stuff below, and I don't want to be trying to hijack the group against its real intent. (Later...) I've scanned your proposed tags. They're enough to prototype an algorithmic language (like C, Pascal, Basic); right now that's what we need. On with the show... THOUGHTS ON PARSING XPL(s) Thanks to Robert Hopkins, for prototyping some XML-ized Basic-like source code (dubbed "XBasic" here). I think it throws a lot of major issues into relief... 1. First and fundamentally, XBasic is meant to be parsed. There will be an XBasic Grammar, in the form of a DTD, and an XBasic Document Parser, say in the form of a Java module. The Grammar determines whether a block of text is truly XBasic. This lets us guarantee that only correct XBasic ever gets past the Parser. 2. There is an intimate relationship between XML document types and programming languages: both are defined by grammars. This is a very strong condition, and it provides enormous power. All the flexibity of programming languages, and of XML document types, is due to it. 3. The XBasic example implies: AN XPL DOCUMENT TYPE IS A PROGRAMMING LANGUAGE, AND ITS DTD IS ITS GRAMMAR. DOCUMENTS OF ONE XPL TYPE ARE PROGRAMS IN ONE XPL LANGUAGE. To me, this is the heart of XPL. (So now I've clubbed you over the head with that ...) 4. In computer science tradition, XPL document types or DTDs are called "regular languages". There is a huge body of published results on "regular languages and automata". The main users of these results are compiler programmers, and so every text in compiler design begins with an introduction to regular languages, and how to define them in Backus-Naur Form (BNF). BNF has been slightly expanded, and integrated with lexical scanning, resulting in Extended BNF (EBNF), in which DTDs are written. 5. What is an "automaton"? It is a finite state machine which parses input according to a grammar. More obviously, it is a lookup table, in which a parser looks up {present machine state, incoming token} -> {next machine state} And the point of this is, that the particular entries in the lookup table are exactly what the grammar requires; so that a correct sequence of tokens (e.g. XBasic elements) causes the machine to pile up more and more recognized elements, until it arrives at a state called, say, "<XBasic-Parser-State:XPL-Document>" at which it has recognized the "total document element", so to speak; whereas an incorrect sequence of tokens will lead it to a state called "<XBasic-Parser-State:Error-condition> (error-number)" for one error number or another. <footnote-mark> * </footnote-mark> 6. What are these "tokens"? Well, they are integers. In a BNF parser, the elements of the grammar(usually beginning with characters, punctuation marks and strings) form a numbered list. Each element is known internally by its token, which is its number in the list. When the automaton is in one state and waiting for input, the input comes as a token, which determines which particular element is incoming. (The content of the element is available at this time, but that is irrelevant to which element type is being tested. And it's the type that determines grammatical correctness.) 7. The key result in the theory of regular languages is, FOR EVERY REGULAR LANGUAGE, THERE IS AN AUTOMATON WHICH PARSES IT. Better still, there are somewhat complex algorithms which take in a grammar, and produce the very automaton (i.e. element list and lookup table) which parses it. XML, and all markup languages, are regular AND SIMPLE. In truth, they are a compiler-writer's dream, because the tags can guide the parsing process. This eliminates so much ambiguity and second-guessing, such as occurs with non-tagged languages like C. With most programming languages, a sequence of a few minor elements will not be enough to specify what larger element is being recognized. Nested tags eliminate this problem. Parsing XPLs can therefore be fast. AND EASILY PROGRAMMED. 8. So fine for theory. Is there a free software tool which does the job for us mortals? There is. The tool is Yacc (or if we're in a GNU C environment, Bison). Yacc takes a BNF grammar, and generates a parser for it, in the form of a C binding, an element list, and a quite large lookup table. The latter two form the automaton for the grammar. (The C binding is mostly for handling the content of elements, rather than their type.) Yacc output looks hideous. But once we have compiled it, with a standard C compiler, and linked the result into our favourite C program, then our C program has the ability to parse the grammar. With the speed of a greased weasel. Because table lookup is as simple as can be. This is why just about all compilers are built around Yacc parsers. 9. At this point, we have a rough analogy: In C, the process goes like this: Yacc: <BNF grammar> XBasic.y -> <lookup table etc> y.tab.c gcc: <C source> y.tab.c -> <C lib> y.tab.o gcc: <C lib> y.tab.o,main.o -> Executable, including XBasic In Java, using say James Clark's XP parser, it goes: XP: <EBNF grammar> XBasic.dtd -> <Java> com.jclark.xml.parse The application's Java code is interfaced as callbacks to the com.jclark.xml.parse, so that the parser can activate the code as it recognizes elements. (At least I think that's it.) 10. Now let me emphasize that we have to go through one of these processes, just to get a parser which recognizes well-formed XBasic when it sees it. And whichever way we go, the situation is already fairly messy. That is, I know how messy the Yacc route can get; I'm not experienced in Java, so the XP route may be less messy than it looks. The Java in [XML Complete], which is called from Microsoft's MSXML parser, looks obscure to me - but then, callback code always does. For what it's worth, James Clark's expat, a C toolkit for adding XML parsing to C programs, is being built into Netscape 5 and Perl - so clearly he knows what he's doing, and the open-source communities must agree. 11. If you feel you're getting out of your depth here, don't worry. The essence of parsing is this two-step process: (1) Parser-Generator : Grammar -> Parser (2) Parser: Correct XBasic -> "This is a true XBasic document." or Parser: Anything Else -> "This is not XBasic, and <here> is the first element which is not as expected." I'm saying that there is already a body of theory about this, and a free C-based parser-generator technology. Any old-school compiler hacker would reach for the Yacc manual without a second thought. However, (1) Java is cross-platform, and downloadable via HTML, and (2) EBNF may not be easy to convert to good old BNF. So Java parser-generators may be the way to go, and also Java parsers. Comparing Yacc with the Java-based XP: 11.1 Yacc would probably produce an XPL parser faster; but since loading and running an XPL application requires the parser-generator to be produced ONCE, and XPL programs to be parsed MANY TIMES, this is not a compelling advantage by itself. 11.2 However, the XPL parser which Yacc produces would certainly be faster: as in, There goes that greased weasel. Producing a parser in the form of an automaton is equivalent to compiling the parser. This is an advantage which should eventually be addressed. (I'd be interested in the exercise of writing a lookup function for the Java Virtual Machine (JVM). Half as fast as that there weasel, and cross-platform, and dynamically downloadable like Java.) 11.3 Yacc may be overkill for markup languages, where the elements are introduced by tags. 11.4 Yacc would create an XPL parser in the form of a pair of C data structures, i.e. a list of tokens (elements) and an array of integers (lookup table). A little programming work would be required to convert this into Java (but we would only have to do that work once EVER). The one thing we should NOT do, is hand-code a Java program specifically for parsing XPL programs. Because every time we changed the XPL specification during development, we would have to modify the parser, and debug and test it. This would slow research and development badly, and unnecessarily. So it seems that parser-generator: xpl.dtd -> xpl.parser is the only way to go. 12. Interestingly to me, some of the most active members of the group are pretty new to programming, just as I am pretty new to Java and XML. This means: 12.1 We should NOT take any long detours through C programming. 12.2 We SHOULD get as much leverage as possible from existing Java efforts, such as James Clark's. So, regardless of any advantages I think Yacc has, it should not sway the group from its path in this experimental phase. In other words, we should be following up Michael's and Robert's initiative: to specify a straightforward algorithmic language like Basic, and implement the parser on James Clark's XP. 13. HOWEVER, I think there's a lot more to the total situation than that. All the above is only the beginning of what the XBasic example says to me. <smiley> Whew. <\smiley> Just about enough for now, but one last thing. I think I see a niche for myself here, filling out the FAQ. Can we settle on a mail-title prefix, say [xpl.faq], for people to send in likely-to-be-frequently-asked questions? I'll make it my business to collate, edit and recirculate all answers, or else make up my own. ASK ANYTHING, people. The more obvious, the better. Aly's "What, in a very small nutshell, is compilation?" is PERFECT for the Web community which we will want to entertain, enlighten, and draw into our orbit. I'm going to work hard on this one, it's a foundation stone. More anon Jonno <footnote> "*", <arfur-daley> Yer gotta 'ave at least one sentence, that's full of semicolons, nested clauses, parentheses and indentations and occupies the paragraph, Terence. Otherwise the punters will lose respect. </arfur-daley> </footnote> ---------------------------------------------------------------------- -------- pcworldani.gif ---------------------------------------------------------------------- -------- To unsubscribe from this group, send an email to: xpl-unsubscribe@o... --- End forwarded message --- |