Re: [RBMetacode] very interesting parsing discussion in a LLVM tutorial

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Mar 13, 2008, at 2:36 AM, Thomas Tempelmann wrote:

> I see that Jonanathan and my attempt at the AST with Morphe2 comes  
> pretty
> close to this design, which is a relief to me :)

Yes, it's a good design.

> All AST classes are derived from "ASTNode". This base class has a  
> "items"
> array in which the "arguments" of any subclass are assembled.

I started with such a generic "items" (or "subnodes" as I called it)  
array in my syntax node design too, but scrapped it because in favor  
of something more specific to each subclass that needs it.  It's not  
a major issue, but I do feel that pushing these references down into  
the subclasses will likely reduce errors and slightly improve  
performance.

> The article, under "Binary Expression Parsing", suggests to use a  
> precedence
> table for all binary operators and groups them by this...
>
> While this makes sense in a hand-coded parser as keeps the code  
> leaner,
> using Morphe2, which interprets a grammar and converts this into  
> program
> code, the programmer (us) usually works with the grammar, not with the
> parser code. When you have the grammar in front of you, it's IMO  
> easier to
> write it in the classic way where you outspell the rules for each  
> precedence
> case (i.e. term, product, etc.) because using the precedence table  
> in the
> grammar makes it less readable.

I'm not sure I'd agree with that.  Spelling out all the rules to  
embody precedence in a LR grammar is a lot of work, and easy to screw  
up, in my experience.  It's also a nuisance when you add a new  
operator (or need to change an existing one), as it requires changes  
in several places; or when somebody asks for a simple list of  
operators in precedence order.  (And LA grammars are even worse.)

> I really encourage you all to check out the TTsRBParser and  
> TTsASTParser
> which both use the Morphe2 grammar and output a tree (the former as a
> string, which looks similar to what ParserProto does, and the latter
> generates a tree of ASTNodes).

Just to be clear, ParserProto generates a tree of syntax nodes, and  
then uses this tree to construct a string if that's what's wanted.   
(I'm not yet happy with the format of that string -- it's only for  
debugging purposes, but we still should make it as clear and readable  
as we can.)

> Sure, when you like to learn how parsing works, do it by hand, but  
> once you
> want to keep up with such a complex language as RB, with its often- 
> changing
> syntax, you'll be better off with a grammar you can adjust quickly  
> to cope
> with the changes.

I used to think so too -- indeed, I wrote the current RB compiler  
using Bison.  But a few years later, after talking to some other  
experienced compiler engineers and learning more about real-world  
compilers (including very well-known ones like gcc), which use hand- 
coded parsers, I'm starting to turn to the other side.

Parser generators are great for use in a classroom setting, or for  
research, where parsing is just an annoying step to get out of the  
way before moving onto a single specific problem you're trying to  
address.  But in a real development environment -- especially a  
highly integrated one -- parsing is an important part of almost  
everything you do.  It's used for editing, refactoring, error  
reporting, incremental compiling, context-sensitive help, and many  
other things.  Actually writing the parser by hand is not that hard,  
a small task compared to the rest of the code that uses it -- and  
doing it by hand enables a lot of things that are hard to do with  
auto-generated parsers.

For example: consider parsing of code fragments.  Much editor support  
needs to parse lines (and even incomplete lines), rather than entire  
programs.  Refactoring tools need to parse blocks of lines.  An  
expression evaluator (like that in RB's properties pane) needs to  
parse expressions.  An auto-generated parser can't generally handle  
this, because the whole grammar is built around generating a "top" or  
root nonterminal.  Any input that doesn't match all the way to the  
root nonterminal is rejected.  There are ways to shoehorn a grammar  
into allowing specific sorts of fragments (i.e., at the top level you  
could explicitly allow a statement or an expression to map to the  
root), but they're crude and can cause other problems.

With a hand-written parser, it's no big deal -- you can call into the  
parser at any point, for example, at the "ParseExpr" method if you  
want to just parse an expression.  (Or better yet, keep things neat  
and clean by dividing the parser into chain of subclasses, and then  
instantiate the class that reflects how much of the grammar you need.)

As another example, consider what happens when a quirk is discovered  
in your grammar and code isn't being parsed as you expected (and yes,  
this DOES happen!).  With a hand-written parser, it's a pretty easy  
job of stepping through the code and seeing what it's doing.  With an  
auto-generated one on a nontrivial grammar, stepping through the code  
is meaningless, and you're reduced to fiddling with the grammar file  
and simulating it in your head to try to understand what's happened.   
(There are ways to examine the stack as it works and such, but these  
are of only limited help.)

> Let's not keep re-inventing the wheel here but get some _solutions_  
> done! If
> you se reasons why Morphe2 is not suitable well, then let's hear  
> them, but
> keep in mind that Jonathan wrote Morphe2 especially to write a RB  
> languge
> parsers, so it should meet the needs by design. And if we should  
> ever get
> him to join this group (where is he, he has not commented on  
> anything yet),
> he could maybe give us some guidedance on this as well.

I think he's in the group -- Jon, you there?

I respect your frustration, but this is a key decision and well worth  
taking carefully.  Are you coming to REAL World next week?  If so,  
maybe we could discuss it more there.

Best,
- Joe