[RBMetacode] Tail-Recursive Parser prototype
Status: Planning
Brought to you by:
jstrout
From: Joe S. <jo...@st...> - 2008-02-28 21:43:22
|
I thought perhaps the parser issue was too big and slippery to attack on a purely intellectual level, so I've decided to try making a small parser so we can start getting some experience with it. I should say that I know Thomas and Jon (and perhaps others?) have already done a lot of work in this direction using Jon's nifty parser generator, and that's still an option we should explore. But I want to also explore the hand-written approach, not least because I have an experienced compiler engineer whom I trust advising me to do so. So, if you do a svn update, you'll find a "ParserProto" project. It handles this grammar: Exp: Term Term: Term ('+' Factor)* | Factor Factor: Factor ('*' Atom)* | Atom Atom: <identifier> | <number> So it can handle things like 2+3*4. It parses this into a concrete syntax tree (which is to say, if we were handling parentheses, they would actually be represented in the tree rather than abstracted away right away). There is just one SyntaxNode class, though I imagine we'd probably want a hierarchy of classes if we got more serious about it. Each node corresponds to a coding token, pretty much, as well as a series of noncoding tokens (i.e. whitespace) that may follow it. For inspection purposes, these nodes can print themselves out with or without the noncoding tokens, or can try to reconstruct the original source. This all seems to be working fine, except for any leading whitespace -- I haven't found a good place to put that. (I also didn't have much luck trying to capture a trailing comment; I think to handle both of these, we need a StatementNode subclass that can keep track of leading and trailing noncoding tokens separately.) Big-picture wise, I think we should take a simple subset of the grammar like this, and see how far we can take it as far as error handling, relating any node in the tree to a range of text in the original input, dealing with code fragments, etc. Actually, the current grammar is probably TOO simple; we should add line continuation, a couple more operators, and maybe function calls too. Then maybe we take that grammar subset, and try it out several different ways if necessary to convince ourselves that we've found the best approach for our purposes -- and then push on to implement the full language grammar. Any comments? Best, - Joe |