Re: [RBMetacode] AST generation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Mar 4, 2008, at 10:47 AM, Thomas Tempelmann wrote:

> Folks, I've checked in Jonathan's Morphe2, which is a quite well  
> working
> parser generator.
>
> I've made a grammar for it which is not fully tested yet but tries  
> to deal
> with some language quirks of RB in a smarter way (e.g. Jon, Mars  
> and Aaron
> had not understood the concept of a "designator" (this is how I  
> call it in
> the grammar, I remember that compiler designers have another word  
> for it but
> I forgot) and that's why there's the confusion in the compiler  
> about what's
> an expression and why one can pass a class property "x" to a ByRef  
> parm, but
> not "me.x".

That's a nice enhancement.  And good job digging up "lvalue" for that  
-- we should try to stick to standard terminology as much as  
possible, so that when new people come on board with compiler  
training, they will be able to jump right in and not have to learn  
our quirky terminology.

> So, I encourage you to look at Morphe2 and the two RB grammars I  
> and Jon
> made for it in early 2006 (inside TTsRBParser and TTsASTParser).

I've only done a cursory scan of this so far, but it does look very  
intriguing.  I hope to look at it more deeply soon.  Thanks for  
getting it checked into the repository.

> I must say that I have no experience with ASTs, I only know I what  
> I need in
> the end:
>
> A tree that recreates the source in a way where identifier  
> references become
> objects that point to the declaration of such identifiers, so that  
> one can
> modify the identifier's declaration and have it affect all references
> automatically.
>
> That will help understanding all relationships in the source code for
> refactoring, searching etc.

Yes very true -- though that's not an AST, in itself; it's something  
like an AST plus a symbol table (or set of symbol tables).  You  
actually can't resolve all symbols in one pass over the code, since  
some things may be references to stuff you haven't parsed yet.  So  
there are a couple ways to deal with this:

A. Have a separate mini-parser that scans the code, pulling out only  
the external declarations (things that are accessible from elsewhere,  
like classes, modules, methods, and properties, but not local  
variables).  This gets you an initial set of symbol tables.  Then run  
a full parser, which builds the AST, referencing the external symbols  
and also adding local symbol tables as it goes.

...or...

B. Run the full parser right away, but leave symbol references  
unresolved (or at least, most of them -- I guess you could resolve  
local ones).  This generates an AST, and also gathers up all the  
external references.  It could in fact generate a complete set of  
symbol tables by the time it's done, but most of the symbol  
references in the AST would remain unresolved.  Then, you go back in  
a separate pass and resolve the symbol references; either as its own  
step, or as part of doing whatever you do next (e.g., compiling the  
AST).

In our case, we're not really writing a compiler; we're writing code  
analysis tools.  For many of those, having unresolved symbols in the  
AST isn't terribly useful.  Also, if you're making something like an  
editor, redoing a full parse of the whole source base whenever you  
think an external reference may have changed is pretty expensive.   
Option "A" is a lot cheaper in such an environment, because you just  
rerun the mini-parser when an external interface changes.

So, I think to move forward we would need to do something like the  
following:

1. Define a symbol table data structure -- something to hold symbol  
declarations and represent scope (all the way from global to local,  
and everything in between).

2. Make a mini-parser that can scan a project and pull out all  
nonlocal symbols.

3. Make a full parser that takes the set of nonlocal symbol tables  
along with a hunk of code (which could be the whole program again, or  
maybe just one method).  This should generate an AST (or a concrete  
ST, depending on what you intend to do with it), and tie each  
identifier to a specific entry in some symbol table (or else tell you  
that it's undefined).

In an editor, we'd only do step 3 on whatever method the user is  
editing (most of the time).  That'll tell us where any identifier is  
declared, what its type is, and what its scope is, which is the sort  
of thing we need to do syntax coloring and refactoring and so on.

Best,
- Joe