Re: [RBMetacode] AST generation
Status: Planning
Brought to you by:
jstrout
From: Joe S. <jo...@st...> - 2008-03-04 18:28:34
|
On Mar 4, 2008, at 10:47 AM, Thomas Tempelmann wrote: > Folks, I've checked in Jonathan's Morphe2, which is a quite well > working > parser generator. > > I've made a grammar for it which is not fully tested yet but tries > to deal > with some language quirks of RB in a smarter way (e.g. Jon, Mars > and Aaron > had not understood the concept of a "designator" (this is how I > call it in > the grammar, I remember that compiler designers have another word > for it but > I forgot) and that's why there's the confusion in the compiler > about what's > an expression and why one can pass a class property "x" to a ByRef > parm, but > not "me.x". That's a nice enhancement. And good job digging up "lvalue" for that -- we should try to stick to standard terminology as much as possible, so that when new people come on board with compiler training, they will be able to jump right in and not have to learn our quirky terminology. > So, I encourage you to look at Morphe2 and the two RB grammars I > and Jon > made for it in early 2006 (inside TTsRBParser and TTsASTParser). I've only done a cursory scan of this so far, but it does look very intriguing. I hope to look at it more deeply soon. Thanks for getting it checked into the repository. > I must say that I have no experience with ASTs, I only know I what > I need in > the end: > > A tree that recreates the source in a way where identifier > references become > objects that point to the declaration of such identifiers, so that > one can > modify the identifier's declaration and have it affect all references > automatically. > > That will help understanding all relationships in the source code for > refactoring, searching etc. Yes very true -- though that's not an AST, in itself; it's something like an AST plus a symbol table (or set of symbol tables). You actually can't resolve all symbols in one pass over the code, since some things may be references to stuff you haven't parsed yet. So there are a couple ways to deal with this: A. Have a separate mini-parser that scans the code, pulling out only the external declarations (things that are accessible from elsewhere, like classes, modules, methods, and properties, but not local variables). This gets you an initial set of symbol tables. Then run a full parser, which builds the AST, referencing the external symbols and also adding local symbol tables as it goes. ...or... B. Run the full parser right away, but leave symbol references unresolved (or at least, most of them -- I guess you could resolve local ones). This generates an AST, and also gathers up all the external references. It could in fact generate a complete set of symbol tables by the time it's done, but most of the symbol references in the AST would remain unresolved. Then, you go back in a separate pass and resolve the symbol references; either as its own step, or as part of doing whatever you do next (e.g., compiling the AST). In our case, we're not really writing a compiler; we're writing code analysis tools. For many of those, having unresolved symbols in the AST isn't terribly useful. Also, if you're making something like an editor, redoing a full parse of the whole source base whenever you think an external reference may have changed is pretty expensive. Option "A" is a lot cheaper in such an environment, because you just rerun the mini-parser when an external interface changes. So, I think to move forward we would need to do something like the following: 1. Define a symbol table data structure -- something to hold symbol declarations and represent scope (all the way from global to local, and everything in between). 2. Make a mini-parser that can scan a project and pull out all nonlocal symbols. 3. Make a full parser that takes the set of nonlocal symbol tables along with a hunk of code (which could be the whole program again, or maybe just one method). This should generate an AST (or a concrete ST, depending on what you intend to do with it), and tie each identifier to a specific entry in some symbol table (or else tell you that it's undefined). In an editor, we'd only do step 3 on whatever method the user is editing (most of the time). That'll tell us where any identifier is declared, what its type is, and what its scope is, which is the sort of thing we need to do syntax coloring and refactoring and so on. Best, - Joe |