[exprla-devel] Re: [xpl] SOME IDEAS

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

--- In xpl-dev@y..., "Garreth Galligan" <garreth.galligan@o...> wrote:

Well I agree with pretty much everything Jonathan has said, 
particularly 12.1 and 12.2 given the group's current lack of low-
level engineering experience (myself included). May as well approach 
something we have a realisitc chance of finishing.

>    AN XPL  DOCUMENT TYPE IS A PROGRAMMING LANGUAGE, 
>    AND ITS DTD IS  ITS GRAMMAR. DOCUMENTS OF ONE 
>    XPL TYPE ARE PROGRAMS IN  ONE  XPL LANGUAGE.

Very good :) Now the X in XPL is defined. I think this is the nearest 
we have got to a 'mission statement' of definition for XPL. Anyone?

A first possible step would be to compile a list of reference URLs 
related to some of the issues raised here. That way we can all taken 
a dip in a common knowledge-base.

  ----- Original Message ----- 
  From: Jonathan Burns 
  To: xpl@e... 
  Sent: Saturday, April 29, 2000 2:43 AM
  Subject: Re: [xpl] SOME IDEAS

  Michael Lauzon wrote: 

  >  I have made some XPL tags, which when I have time I'll 
  > upload them as a text file to our groups file site on Onelist. 
  > We can go from there,though this is a good start...also remember 
  > XPL is supposed to be a  compiled language, so if anyone knows 
  > how to program in Java (while  also using the XML parser [for 
  > Java] from James Clark; http://www.jclark.com/) would be 
  > helpful.  :) 

  Michael, are there working documents, any notes or old 
  email, on how XPL was conceived? Are we going to be the 
  ones who define the original charter? I ask because I'll 
  be injecting some weird stuff below, and I don't want to 
  be trying to hijack the group against its real intent. 
  (Later...) I've scanned your proposed tags. They're enough 
  to prototype an algorithmic language (like C, Pascal, Basic); 
  right now that's what we need. 

  On with the show... 

  THOUGHTS ON PARSING XPL(s) 

  Thanks to Robert Hopkins, for prototyping some XML-ized 
  Basic-like source code (dubbed "XBasic" here). I think 
  it throws a lot of major issues into relief... 

  1. First and fundamentally, XBasic is meant to be parsed. 
  There will be an XBasic Grammar, in the form of a DTD, 
  and an XBasic Document Parser,  say in the form of a  Java 
  module. 

  The Grammar determines  whether a block of text is truly 
  XBasic. This lets us guarantee that only correct XBasic 
  ever gets past  the Parser. 

  2. There is an intimate relationship  between XML document types 
  and programming languages:  both are defined by grammars. 

  This is a very strong  condition, and it provides enormous power. 
  All the flexibity of programming languages, and of XML document 
  types, is due to it. 

  3. The XBasic example implies: 

      AN XPL  DOCUMENT TYPE IS A PROGRAMMING LANGUAGE, 
      AND ITS DTD IS  ITS GRAMMAR. DOCUMENTS OF ONE 
      XPL TYPE ARE PROGRAMS IN  ONE  XPL LANGUAGE. 

  To me, this is the heart of XPL. 

  (So now I've clubbed you over the head with that ...) 

  4. In computer science tradition, XPL document types or 
  DTDs are called "regular languages". There is a huge body 
  of published results on "regular languages and automata". 
  The main users of these results are compiler  programmers, 
  and so every text in compiler design begins  with an 
  introduction to regular languages, and how to define them 
  in Backus-Naur Form (BNF). BNF has been  slightly expanded, 
  and integrated with lexical scanning, resulting in 
  Extended BNF (EBNF), in which DTDs are written. 

  5. What is an "automaton"? It is a finite state machine 
  which parses input according to a grammar. More obviously, 
  it is a lookup table, in which a parser looks up 

  {present machine state, incoming token} -> {next machine state} 

  And the point of this is, that the particular entries 
  in the lookup table are exactly what the grammar requires; 
  so that a correct sequence of tokens (e.g. XBasic elements) 
  causes the machine to pile up more and more recognized 
  elements, until it arrives at a state called, say, 

      "<XBasic-Parser-State:XPL-Document>" 

  at which it has recognized the "total document element", 
  so to speak; whereas an incorrect sequence of tokens 
  will lead it to a state called 

      "<XBasic-Parser-State:Error-condition> (error-number)" 

  for one error number or another. 

  <footnote-mark> * </footnote-mark> 

  6. What are these "tokens"? Well, they  are integers. 

  In a BNF parser, the elements of the grammar(usually 
  beginning with characters, punctuation marks and strings) 
  form a numbered list. Each element is known internally 
  by its token, which is its number in the list. 

  When the automaton is in one state and waiting for input, 
  the input comes as a token, which determines which particular 
  element is incoming. 

  (The content of the element is available  at this time, 
  but that is irrelevant to which element type is being 
  tested. And it's the type that determines grammatical 
  correctness.) 

  7. The key result in the theory of regular languages is, 

         FOR EVERY REGULAR LANGUAGE, THERE IS 
         AN AUTOMATON WHICH PARSES IT. 

  Better still, there are somewhat complex algorithms 
  which take in a grammar, and produce the very  automaton 
  (i.e. element list and lookup table) which parses it. 

  XML, and all markup languages, are regular AND SIMPLE. 

  In truth, they are a compiler-writer's dream, because 
  the tags can guide the parsing process. This eliminates 
  so much ambiguity and second-guessing, such as occurs 
  with non-tagged languages like C. With most programming 
  languages, a sequence of a few minor elements will not 
  be enough to specify what larger element is being 
  recognized. Nested tags eliminate this problem. Parsing 
  XPLs can therefore be fast. AND EASILY PROGRAMMED. 

  8. So fine for theory. Is there a free software tool 
  which does the job for us mortals? There is. 

  The tool is Yacc (or if we're in a GNU C environment, 
  Bison). Yacc takes a BNF grammar, and generates a parser 
  for it, in the form of a C binding, an element list, 
  and a quite large lookup table. The latter two form the 
  automaton for the grammar. 

  (The C binding is mostly for  handling the content of elements, 
rather than their type.) 

  Yacc output looks hideous. But once we have compiled it, 
  with a standard C  compiler, and linked the result into our 
  favourite C  program, then our C program has the ability 
  to parse  the grammar. With the speed of a greased weasel. 
  Because table lookup is as simple as can be. This is why 
  just about all compilers are built around Yacc parsers. 

  9. At this point, we have a rough analogy: 

  In C, the process goes like this: 

  Yacc: <BNF grammar> XBasic.y    ->    <lookup table etc> y.tab.c 
  gcc:  <C source> y.tab.c        ->    <C lib> y.tab.o 
  gcc:  <C lib> y.tab.o,main.o    ->    Executable, including XBasic 

  In Java, using say James Clark's XP parser, it goes: 

  XP: <EBNF grammar> XBasic.dtd   ->    <Java> com.jclark.xml.parse 

  The application's Java code is interfaced as callbacks to 
  the com.jclark.xml.parse, so that the parser can activate 
  the code as it recognizes elements. (At least I think that's it.) 

  10. Now let me emphasize that we have to go through one 
  of these processes, just to get a parser which recognizes 
  well-formed XBasic when it sees it. And whichever way we go, 
  the situation is already fairly messy. That is, I know how 
  messy the Yacc route can get; I'm not experienced in Java, 
  so the XP route may be less messy than it looks. 

  The Java in [XML Complete], which is called from Microsoft's 
  MSXML  parser, looks obscure to me - but then, callback code 
  always  does. For what it's worth, James Clark's  expat, a C 
  toolkit for adding XML parsing to C programs,  is being built 
  into Netscape 5 and Perl - so clearly he knows what he's 
  doing, and the open-source communities must agree. 

  11. If you feel you're getting out of your depth here, 
  don't worry. The essence of parsing is this two-step 
  process: 

  (1) Parser-Generator : Grammar -> Parser 
  (2) Parser: Correct XBasic -> "This is a true XBasic document." 
      or 
      Parser: Anything Else -> "This is not XBasic, and <here> 
                             is the first element which is 
                             not as expected." 

  I'm saying that there is already a body of theory about this, 
  and a free C-based parser-generator technology. Any old-school 
  compiler hacker would reach for the Yacc manual without a 
  second thought. However, (1) Java is cross-platform, and 
  downloadable via HTML, and (2) EBNF may not be easy to 
  convert to good old BNF. So Java parser-generators may 
  be the  way to go, and also Java parsers. 

  Comparing Yacc with the Java-based XP: 

  11.1  Yacc would probably produce an XPL parser faster; 
        but since loading and running an XPL application 
        requires the parser-generator to be produced ONCE, 
        and XPL programs to be parsed MANY TIMES, this is 
        not a compelling advantage by itself. 

  11.2  However, the XPL parser which Yacc produces would 
        certainly be faster: as in, There goes that greased 
        weasel. Producing a parser in the form of an automaton 
        is equivalent to compiling the parser. This is an 
        advantage which should eventually be addressed. 
        (I'd be interested in the exercise of writing a 
        lookup function for the Java Virtual Machine (JVM). 
        Half as fast as that there weasel, and cross-platform, 
        and dynamically downloadable like Java.) 

  11.3  Yacc may be overkill for markup languages, where 
        the elements are introduced by tags. 

  11.4  Yacc would create an XPL parser in the form of 
        a pair of C data structures, i.e. a list of 
        tokens (elements) and an array of integers 
        (lookup table). A little programming work would 
        be required to convert this into Java (but we 
        would only have to do that work once EVER). 

  The one thing we should NOT do, is hand-code a Java 
  program specifically for parsing XPL programs. Because 
  every time we changed the XPL specification during 
  development, we would have to modify the parser, and 
  debug and test it. This would slow research and 
  development badly, and unnecessarily. 

  So it seems that 

      parser-generator: xpl.dtd -> xpl.parser 

  is the only way to go. 

  12. Interestingly to me, some of the most active members 
  of the group are pretty new to programming, just as I am 
  pretty new to Java and XML. This means: 

  12.1 We should NOT take any long detours through C programming. 

  12.2 We SHOULD get as much leverage as possible from existing 
       Java efforts, such as James Clark's. 

  So, regardless of any advantages I think Yacc has, it should 
  not sway the group from its path in this experimental phase. 

  In other words, we should be following up Michael's and 
  Robert's initiative: to specify a straightforward 
  algorithmic language like Basic, and implement the parser 
  on James Clark's XP. 

  13. HOWEVER, I think there's a lot more to the total 
  situation than that. All the above is only the beginning 
  of what the XBasic example says to me. 

  Just about enough for now, but one last thing. 

  I think I see a niche for myself here, filling out the FAQ. 
  Can we settle on a mail-title prefix, say [xpl.faq], for 
  people to send in likely-to-be-frequently-asked questions? 
  I'll make it my business to collate, edit and recirculate 
  all answers, or else make up my own. 

  ASK ANYTHING, people. The more obvious, the better. 

  Aly's "What, in a very small nutshell, is compilation?" 
  is PERFECT for the Web community which we will want to 
  entertain, enlighten, and draw into our orbit. I'm going 
  to work hard on this one, it's a foundation stone. 

  More anon 
  Jonno 

  <footnote> "*", 
      <arfur-daley> 
      Yer gotta 'ave at least one sentence, that's full of 
semicolons, nested clauses, parentheses and 
      indentations and occupies the paragraph, Terence. Otherwise the 
punters will lose respect. 
   </arfur-daley> 
  </footnote> 

----------------------------------------------------------------------
--------
  pcworldani.gif 
----------------------------------------------------------------------
--------
  To unsubscribe from this group, send an email to:
  xpl-unsubscribe@o...
--- End forwarded message ---