Re: [RBMetacode] AST generation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Mar 4, 2008, at 11:55 AM, Thomas Tempelmann wrote:

>> Option "A" is a lot cheaper in such an environment, because you just
>> rerun the mini-parser when an external interface changes.
>
> I'd rather go with Option B, though: It should be faster (parsing the
> original source code is rather slow compared to traversing the  
> generated
> tree and looking for the right types) and it's more universal as  
> we're first
> building a more general tree which we then attribute with extra  
> information.
>
> With Option B, we also can see better what's left (e.g. all the  
> identifiers
> from the RB framework), filter them and recreate a table from that  
> to use as
> a prototype of the RB framework.
>
> What do you and others think?

I'll wait for other opinions, and let this all sink in a bit, before  
I opine any more on this issue myself.

> In any case, what should be the structure we're generating for the  
> tree?
> Does the one the TTsASTParser tries to generate make sense to you?
> Basically, it uses separate classes for each type of grammar node, for
> example it has a unique class for each possible operator. I think  
> that's a
> bit of overkill - I'd rather have a class for each of the operator
> priorities (one for plus/minus/or, one for mul/div/and, one for  
> comparisons,
> one for not and negate, etc., and then add a property naming the  
> particular
> operator). That way, the hierarchy of operators is seen right in  
> the tree.

I agree that a separate class for each operator is overkill.  I'm not  
even convinced you need a separate class for different priorities.   
The parser needs separate rules for them (well, maybe, depending on  
what type of parser it is), but the AST shouldn't care about that --  
by the time you have an AST, operator precedence no longer matters.   
It's already been taken care of.

So I've been using just a BinopNode to represent all binary operator  
nodes, and it's been working fine for me.  (Actually, that's a bit of  
an exaggeration; it's only for binary operators within expressions.   
The assignment operator is a separate node.)

> Similarly, one class for all kind of loops, maybe, and a sub-class
> identifying the different kinds of loops with their parameters. But  
> maybe
> this won't work because it requires that the code is free of errors  
> such as
> a missing "next". Maybe it's better that the initial tree is free  
> of such
> semantics?

Well, I think you do need nodes that represent various kinds of loops  
-- and organizing these into a class for all loops with a subclass  
for specific kinds of loops makes sense to me.  That way, any code  
that needs to deal with loops in general can address the superclass,  
and code that needs to be more specific can deal with the subclasses.

We need to take care to distinguish between an AST and a CST  
(Concrete Syntax Tree), though.  A CST can always be converted into  
an AST, but the reverse is not true.  There area lot of minor  
differences, but for example, an AST won't contain any parentheses,  
since those don't affect the abstract syntax; the guts of the  
parentheses will simply appear as an expression node directly within  
whatever node is using that parenthesized expression.  But in a CST,  
the parens are like an operator, represented with their own node in  
the tree.

I've been generating a CST, and even keeping track of noncoding  
tokens (i.e. whitespace), since that's actually more general -- you  
can do pretty much anything with it you could do with an AST, but  
also more, such as reconstructing the original source.  That'll be  
important for things like refactoring tools, where you want to  
rearrange the source tree a bit, but maintain the user's original  
syntax as much as possible.

Best,
- Joe