[RBMetacode] Tail-Recursive Parser prototype
Status: Planning
Brought to you by:
jstrout
|
From: Joe S. <jo...@st...> - 2008-02-28 21:43:22
|
I thought perhaps the parser issue was too big and slippery to attack
on a purely intellectual level, so I've decided to try making a small
parser so we can start getting some experience with it.
I should say that I know Thomas and Jon (and perhaps others?) have
already done a lot of work in this direction using Jon's nifty parser
generator, and that's still an option we should explore. But I want
to also explore the hand-written approach, not least because I have
an experienced compiler engineer whom I trust advising me to do so.
So, if you do a svn update, you'll find a "ParserProto" project. It
handles this grammar:
Exp: Term
Term: Term ('+' Factor)* | Factor
Factor: Factor ('*' Atom)* | Atom
Atom: <identifier> | <number>
So it can handle things like 2+3*4. It parses this into a concrete
syntax tree (which is to say, if we were handling parentheses, they
would actually be represented in the tree rather than abstracted away
right away). There is just one SyntaxNode class, though I imagine
we'd probably want a hierarchy of classes if we got more serious
about it. Each node corresponds to a coding token, pretty much, as
well as a series of noncoding tokens (i.e. whitespace) that may
follow it.
For inspection purposes, these nodes can print themselves out with or
without the noncoding tokens, or can try to reconstruct the original
source. This all seems to be working fine, except for any leading
whitespace -- I haven't found a good place to put that. (I also
didn't have much luck trying to capture a trailing comment; I think
to handle both of these, we need a StatementNode subclass that can
keep track of leading and trailing noncoding tokens separately.)
Big-picture wise, I think we should take a simple subset of the
grammar like this, and see how far we can take it as far as error
handling, relating any node in the tree to a range of text in the
original input, dealing with code fragments, etc. Actually, the
current grammar is probably TOO simple; we should add line
continuation, a couple more operators, and maybe function calls too.
Then maybe we take that grammar subset, and try it out several
different ways if necessary to convince ourselves that we've found
the best approach for our purposes -- and then push on to implement
the full language grammar.
Any comments?
Best,
- Joe
|