Thread: [RBMetacode] very interesting parsing discussion in a LLVM tutorial

Status: Planning

Brought to you by: jstrout

rbmetacode-list

[RBMetacode] very interesting parsing discussion in a LLVM tutorial

From: Joe S. <jo...@st...> - 2008-03-11 14:37:14

I ran across this very interesting page today:

     <http://www.llvm.org/docs/tutorial/LangImpl2.html>

This is part of a tutorial for LLVM, in which you implement an entire  
(simple) language in about an hour or so.  Most of the heavy lifting  
is done by LLVM itself, which is a topic for another time (LLVM is an  
open-source compiler infrastructure that is very impressive).  But  
this chapter of the tutorial in particular has some very useful  
nuggets for us -- in particular, the operator precedence parsing that  
they use to handle binary operators.  That would be more efficient  
than the approach currently used in the ParserProto (not sure how it  
compares to what Morphe2 generates).

I plan to go through this whole tutorial sometime soon (perhaps while  
travelling next week, if I can get the LLVM code to work on my  
ancient Powerbook).  But I'd recommend at least this chapter to  
anyone here who's interested in parsing RB code.

Best,
- Joe

--
Joe Strout
Inspiring Applications, Inc.
http://www.InspiringApps.com

Re: [RBMetacode] very interesting parsing discussion in a LLVM tutorial

From: Thomas T. <tt...@te...> - 2008-03-13 08:36:43

On 11.03.2008 15:36 Uhr, "Joe Strout" <jo...@st...> wrote:

> I ran across this very interesting page today:
> 
>      <http://www.llvm.org/docs/tutorial/LangImpl2.html>

Thanks. Good reading on some principles, e.g. it explains what the outcome
and purpose of a AST shall, and how to design the AST classes.

I see that Jonanathan and my attempt at the AST with Morphe2 comes pretty
close to this design, which is a relief to me :)

(After all, Jonathan did his homework on parser generator design and this
was already his second version, rebuilt from scratch, and then had another
overhaul once we tried it with the RB grammar.)

If you check out the "TTsASTParser" from out "RBMetacode" svn repository at
sourceforge.net, you'll find a set of classes in there for the AST. There is
an "Expression" class, which then has direct subclasses such as "Operator"
and "LiteralTerm". The "Operator" has subclasses for unary and binary
operators, and so on. The classes are still missing their data and
constructor members, though (that's why this project not finished yet,
basically).

All AST classes are derived from "ASTNode". This base class has a "items"
array in which the "arguments" of any subclass are assembled. E.g, an
expression such as "1 + x" will consist of a "TermExpr" (for operators such
as "+", "-", "or") with two array items "IntegerLiteral" and, uh .. whoops,
I see that there's still missing an "Identifier" node class as a subclass of
"Expression".

Anyhow, I hope you get the basic idea. To store the direct item ref'd by a
ASTNode subclass, maybe a simple Variant would do which then stores the
current token (i.e. the TermExpr would store the "+", while the
IntegerLiteral would store the "1" from the example above). That way, not
every subclass needs their own constructor and accessors for this (too much
work to type them all...)

So, yes, this is a good article and if you read this, you should have no
problem understanding how Morphe2 and TTsASTParser should work.

> this chapter of the tutorial in particular has some very useful
> nuggets for us -- in particular, the operator precedence parsing that
> they use to handle binary operators.  That would be more efficient
> than the approach currently used in the ParserProto (not sure how it
> compares to what Morphe2 generates).

The article, under "Binary Expression Parsing", suggests to use a precedence
table for all binary operators and groups them by this. So, instead of using
special handlers for terms (plus, minus, or) and products (mul, div, and)
and so on, it uses one central binary operator handler that deals with them
all based on their value from the precedence table.

While this makes sense in a hand-coded parser as keeps the code leaner,
using Morphe2, which interprets a grammar and converts this into program
code, the programmer (us) usually works with the grammar, not with the
parser code. When you have the grammar in front of you, it's IMO easier to
write it in the classic way where you outspell the rules for each precedence
case (i.e. term, product, etc.) because using the precedence table in the
grammar makes it less readable.

I really encourage you all to check out the TTsRBParser and TTsASTParser
which both use the Morphe2 grammar and output a tree (the former as a
string, which looks similar to what ParserProto does, and the latter
generates a tree of ASTNodes).

Sure, when you like to learn how parsing works, do it by hand, but once you
want to keep up with such a complex language as RB, with its often-changing
syntax, you'll be better off with a grammar you can adjust quickly to cope
with the changes.

Let's not keep re-inventing the wheel here but get some _solutions_ done! If
you se reasons why Morphe2 is not suitable well, then let's hear them, but
keep in mind that Jonathan wrote Morphe2 especially to write a RB languge
parsers, so it should meet the needs by design. And if we should ever get
him to join this group (where is he, he has not commented on anything yet),
he could maybe give us some guidedance on this as well.

Thomas

Re: [RBMetacode] very interesting parsing discussion in a LLVM tutorial

From: Joe S. <jo...@st...> - 2008-03-13 22:54:57

On Mar 13, 2008, at 2:36 AM, Thomas Tempelmann wrote:

> I see that Jonanathan and my attempt at the AST with Morphe2 comes  
> pretty
> close to this design, which is a relief to me :)

Yes, it's a good design.

> All AST classes are derived from "ASTNode". This base class has a  
> "items"
> array in which the "arguments" of any subclass are assembled.

I started with such a generic "items" (or "subnodes" as I called it)  
array in my syntax node design too, but scrapped it because in favor  
of something more specific to each subclass that needs it.  It's not  
a major issue, but I do feel that pushing these references down into  
the subclasses will likely reduce errors and slightly improve  
performance.

> The article, under "Binary Expression Parsing", suggests to use a  
> precedence
> table for all binary operators and groups them by this...
>
> While this makes sense in a hand-coded parser as keeps the code  
> leaner,
> using Morphe2, which interprets a grammar and converts this into  
> program
> code, the programmer (us) usually works with the grammar, not with the
> parser code. When you have the grammar in front of you, it's IMO  
> easier to
> write it in the classic way where you outspell the rules for each  
> precedence
> case (i.e. term, product, etc.) because using the precedence table  
> in the
> grammar makes it less readable.

I'm not sure I'd agree with that.  Spelling out all the rules to  
embody precedence in a LR grammar is a lot of work, and easy to screw  
up, in my experience.  It's also a nuisance when you add a new  
operator (or need to change an existing one), as it requires changes  
in several places; or when somebody asks for a simple list of  
operators in precedence order.  (And LA grammars are even worse.)

> I really encourage you all to check out the TTsRBParser and  
> TTsASTParser
> which both use the Morphe2 grammar and output a tree (the former as a
> string, which looks similar to what ParserProto does, and the latter
> generates a tree of ASTNodes).

Just to be clear, ParserProto generates a tree of syntax nodes, and  
then uses this tree to construct a string if that's what's wanted.   
(I'm not yet happy with the format of that string -- it's only for  
debugging purposes, but we still should make it as clear and readable  
as we can.)

> Sure, when you like to learn how parsing works, do it by hand, but  
> once you
> want to keep up with such a complex language as RB, with its often- 
> changing
> syntax, you'll be better off with a grammar you can adjust quickly  
> to cope
> with the changes.

I used to think so too -- indeed, I wrote the current RB compiler  
using Bison.  But a few years later, after talking to some other  
experienced compiler engineers and learning more about real-world  
compilers (including very well-known ones like gcc), which use hand- 
coded parsers, I'm starting to turn to the other side.

Parser generators are great for use in a classroom setting, or for  
research, where parsing is just an annoying step to get out of the  
way before moving onto a single specific problem you're trying to  
address.  But in a real development environment -- especially a  
highly integrated one -- parsing is an important part of almost  
everything you do.  It's used for editing, refactoring, error  
reporting, incremental compiling, context-sensitive help, and many  
other things.  Actually writing the parser by hand is not that hard,  
a small task compared to the rest of the code that uses it -- and  
doing it by hand enables a lot of things that are hard to do with  
auto-generated parsers.

For example: consider parsing of code fragments.  Much editor support  
needs to parse lines (and even incomplete lines), rather than entire  
programs.  Refactoring tools need to parse blocks of lines.  An  
expression evaluator (like that in RB's properties pane) needs to  
parse expressions.  An auto-generated parser can't generally handle  
this, because the whole grammar is built around generating a "top" or  
root nonterminal.  Any input that doesn't match all the way to the  
root nonterminal is rejected.  There are ways to shoehorn a grammar  
into allowing specific sorts of fragments (i.e., at the top level you  
could explicitly allow a statement or an expression to map to the  
root), but they're crude and can cause other problems.

With a hand-written parser, it's no big deal -- you can call into the  
parser at any point, for example, at the "ParseExpr" method if you  
want to just parse an expression.  (Or better yet, keep things neat  
and clean by dividing the parser into chain of subclasses, and then  
instantiate the class that reflects how much of the grammar you need.)

As another example, consider what happens when a quirk is discovered  
in your grammar and code isn't being parsed as you expected (and yes,  
this DOES happen!).  With a hand-written parser, it's a pretty easy  
job of stepping through the code and seeing what it's doing.  With an  
auto-generated one on a nontrivial grammar, stepping through the code  
is meaningless, and you're reduced to fiddling with the grammar file  
and simulating it in your head to try to understand what's happened.   
(There are ways to examine the stack as it works and such, but these  
are of only limited help.)

> Let's not keep re-inventing the wheel here but get some _solutions_  
> done! If
> you se reasons why Morphe2 is not suitable well, then let's hear  
> them, but
> keep in mind that Jonathan wrote Morphe2 especially to write a RB  
> languge
> parsers, so it should meet the needs by design. And if we should  
> ever get
> him to join this group (where is he, he has not commented on  
> anything yet),
> he could maybe give us some guidedance on this as well.

I think he's in the group -- Jon, you there?

I respect your frustration, but this is a key decision and well worth  
taking carefully.  Are you coming to REAL World next week?  If so,  
maybe we could discuss it more there.

Best,
- Joe

Re: [RBMetacode] very interesting parsing discussion in a LLVM tutorial

From: Thomas T. <tt...@te...> - 2008-03-13 23:40:23

On 13.03.2008 23:54 Uhr, "Joe Strout" <jo...@st...> wrote:

>  Spelling out all the rules to
> embody precedence in a LR grammar is a lot of work, and easy to screw
> up, in my experience.

It's already done in the grammar I encluded. It's not that complicated.

> For example: consider parsing of code fragments

Good point. Hadn't thought of this yet.

> Are you coming to REAL World next week?  If so,
> maybe we could discuss it more there.

Yes, I'm looking forward to it.

I expect to have my RB editor being in a usable state by then, too.

Thomas