Thread: [RBMetacode] very interesting parsing discussion in a LLVM tutorial
Status: Planning
Brought to you by:
jstrout
From: Joe S. <jo...@st...> - 2008-03-11 14:37:14
|
I ran across this very interesting page today: <http://www.llvm.org/docs/tutorial/LangImpl2.html> This is part of a tutorial for LLVM, in which you implement an entire (simple) language in about an hour or so. Most of the heavy lifting is done by LLVM itself, which is a topic for another time (LLVM is an open-source compiler infrastructure that is very impressive). But this chapter of the tutorial in particular has some very useful nuggets for us -- in particular, the operator precedence parsing that they use to handle binary operators. That would be more efficient than the approach currently used in the ParserProto (not sure how it compares to what Morphe2 generates). I plan to go through this whole tutorial sometime soon (perhaps while travelling next week, if I can get the LLVM code to work on my ancient Powerbook). But I'd recommend at least this chapter to anyone here who's interested in parsing RB code. Best, - Joe -- Joe Strout Inspiring Applications, Inc. http://www.InspiringApps.com |
From: Thomas T. <tt...@te...> - 2008-03-13 08:36:43
|
On 11.03.2008 15:36 Uhr, "Joe Strout" <jo...@st...> wrote: > I ran across this very interesting page today: > > <http://www.llvm.org/docs/tutorial/LangImpl2.html> Thanks. Good reading on some principles, e.g. it explains what the outcome and purpose of a AST shall, and how to design the AST classes. I see that Jonanathan and my attempt at the AST with Morphe2 comes pretty close to this design, which is a relief to me :) (After all, Jonathan did his homework on parser generator design and this was already his second version, rebuilt from scratch, and then had another overhaul once we tried it with the RB grammar.) If you check out the "TTsASTParser" from out "RBMetacode" svn repository at sourceforge.net, you'll find a set of classes in there for the AST. There is an "Expression" class, which then has direct subclasses such as "Operator" and "LiteralTerm". The "Operator" has subclasses for unary and binary operators, and so on. The classes are still missing their data and constructor members, though (that's why this project not finished yet, basically). All AST classes are derived from "ASTNode". This base class has a "items" array in which the "arguments" of any subclass are assembled. E.g, an expression such as "1 + x" will consist of a "TermExpr" (for operators such as "+", "-", "or") with two array items "IntegerLiteral" and, uh .. whoops, I see that there's still missing an "Identifier" node class as a subclass of "Expression". Anyhow, I hope you get the basic idea. To store the direct item ref'd by a ASTNode subclass, maybe a simple Variant would do which then stores the current token (i.e. the TermExpr would store the "+", while the IntegerLiteral would store the "1" from the example above). That way, not every subclass needs their own constructor and accessors for this (too much work to type them all...) So, yes, this is a good article and if you read this, you should have no problem understanding how Morphe2 and TTsASTParser should work. > this chapter of the tutorial in particular has some very useful > nuggets for us -- in particular, the operator precedence parsing that > they use to handle binary operators. That would be more efficient > than the approach currently used in the ParserProto (not sure how it > compares to what Morphe2 generates). The article, under "Binary Expression Parsing", suggests to use a precedence table for all binary operators and groups them by this. So, instead of using special handlers for terms (plus, minus, or) and products (mul, div, and) and so on, it uses one central binary operator handler that deals with them all based on their value from the precedence table. While this makes sense in a hand-coded parser as keeps the code leaner, using Morphe2, which interprets a grammar and converts this into program code, the programmer (us) usually works with the grammar, not with the parser code. When you have the grammar in front of you, it's IMO easier to write it in the classic way where you outspell the rules for each precedence case (i.e. term, product, etc.) because using the precedence table in the grammar makes it less readable. I really encourage you all to check out the TTsRBParser and TTsASTParser which both use the Morphe2 grammar and output a tree (the former as a string, which looks similar to what ParserProto does, and the latter generates a tree of ASTNodes). Sure, when you like to learn how parsing works, do it by hand, but once you want to keep up with such a complex language as RB, with its often-changing syntax, you'll be better off with a grammar you can adjust quickly to cope with the changes. Let's not keep re-inventing the wheel here but get some _solutions_ done! If you se reasons why Morphe2 is not suitable well, then let's hear them, but keep in mind that Jonathan wrote Morphe2 especially to write a RB languge parsers, so it should meet the needs by design. And if we should ever get him to join this group (where is he, he has not commented on anything yet), he could maybe give us some guidedance on this as well. Thomas |
From: Joe S. <jo...@st...> - 2008-03-13 22:54:57
|
On Mar 13, 2008, at 2:36 AM, Thomas Tempelmann wrote: > I see that Jonanathan and my attempt at the AST with Morphe2 comes > pretty > close to this design, which is a relief to me :) Yes, it's a good design. > All AST classes are derived from "ASTNode". This base class has a > "items" > array in which the "arguments" of any subclass are assembled. I started with such a generic "items" (or "subnodes" as I called it) array in my syntax node design too, but scrapped it because in favor of something more specific to each subclass that needs it. It's not a major issue, but I do feel that pushing these references down into the subclasses will likely reduce errors and slightly improve performance. > The article, under "Binary Expression Parsing", suggests to use a > precedence > table for all binary operators and groups them by this... > > While this makes sense in a hand-coded parser as keeps the code > leaner, > using Morphe2, which interprets a grammar and converts this into > program > code, the programmer (us) usually works with the grammar, not with the > parser code. When you have the grammar in front of you, it's IMO > easier to > write it in the classic way where you outspell the rules for each > precedence > case (i.e. term, product, etc.) because using the precedence table > in the > grammar makes it less readable. I'm not sure I'd agree with that. Spelling out all the rules to embody precedence in a LR grammar is a lot of work, and easy to screw up, in my experience. It's also a nuisance when you add a new operator (or need to change an existing one), as it requires changes in several places; or when somebody asks for a simple list of operators in precedence order. (And LA grammars are even worse.) > I really encourage you all to check out the TTsRBParser and > TTsASTParser > which both use the Morphe2 grammar and output a tree (the former as a > string, which looks similar to what ParserProto does, and the latter > generates a tree of ASTNodes). Just to be clear, ParserProto generates a tree of syntax nodes, and then uses this tree to construct a string if that's what's wanted. (I'm not yet happy with the format of that string -- it's only for debugging purposes, but we still should make it as clear and readable as we can.) > Sure, when you like to learn how parsing works, do it by hand, but > once you > want to keep up with such a complex language as RB, with its often- > changing > syntax, you'll be better off with a grammar you can adjust quickly > to cope > with the changes. I used to think so too -- indeed, I wrote the current RB compiler using Bison. But a few years later, after talking to some other experienced compiler engineers and learning more about real-world compilers (including very well-known ones like gcc), which use hand- coded parsers, I'm starting to turn to the other side. Parser generators are great for use in a classroom setting, or for research, where parsing is just an annoying step to get out of the way before moving onto a single specific problem you're trying to address. But in a real development environment -- especially a highly integrated one -- parsing is an important part of almost everything you do. It's used for editing, refactoring, error reporting, incremental compiling, context-sensitive help, and many other things. Actually writing the parser by hand is not that hard, a small task compared to the rest of the code that uses it -- and doing it by hand enables a lot of things that are hard to do with auto-generated parsers. For example: consider parsing of code fragments. Much editor support needs to parse lines (and even incomplete lines), rather than entire programs. Refactoring tools need to parse blocks of lines. An expression evaluator (like that in RB's properties pane) needs to parse expressions. An auto-generated parser can't generally handle this, because the whole grammar is built around generating a "top" or root nonterminal. Any input that doesn't match all the way to the root nonterminal is rejected. There are ways to shoehorn a grammar into allowing specific sorts of fragments (i.e., at the top level you could explicitly allow a statement or an expression to map to the root), but they're crude and can cause other problems. With a hand-written parser, it's no big deal -- you can call into the parser at any point, for example, at the "ParseExpr" method if you want to just parse an expression. (Or better yet, keep things neat and clean by dividing the parser into chain of subclasses, and then instantiate the class that reflects how much of the grammar you need.) As another example, consider what happens when a quirk is discovered in your grammar and code isn't being parsed as you expected (and yes, this DOES happen!). With a hand-written parser, it's a pretty easy job of stepping through the code and seeing what it's doing. With an auto-generated one on a nontrivial grammar, stepping through the code is meaningless, and you're reduced to fiddling with the grammar file and simulating it in your head to try to understand what's happened. (There are ways to examine the stack as it works and such, but these are of only limited help.) > Let's not keep re-inventing the wheel here but get some _solutions_ > done! If > you se reasons why Morphe2 is not suitable well, then let's hear > them, but > keep in mind that Jonathan wrote Morphe2 especially to write a RB > languge > parsers, so it should meet the needs by design. And if we should > ever get > him to join this group (where is he, he has not commented on > anything yet), > he could maybe give us some guidedance on this as well. I think he's in the group -- Jon, you there? I respect your frustration, but this is a key decision and well worth taking carefully. Are you coming to REAL World next week? If so, maybe we could discuss it more there. Best, - Joe |
From: Thomas T. <tt...@te...> - 2008-03-13 23:40:23
|
On 13.03.2008 23:54 Uhr, "Joe Strout" <jo...@st...> wrote: > Spelling out all the rules to > embody precedence in a LR grammar is a lot of work, and easy to screw > up, in my experience. It's already done in the grammar I encluded. It's not that complicated. > For example: consider parsing of code fragments Good point. Hadn't thought of this yet. > Are you coming to REAL World next week? If so, > maybe we could discuss it more there. Yes, I'm looking forward to it. I expect to have my RB editor being in a usable state by then, too. Thomas |