Re: [RBMetacode] very interesting parsing discussion in a LLVM tutorial
Status: Planning
Brought to you by:
jstrout
From: Joe S. <jo...@st...> - 2008-03-13 22:54:57
|
On Mar 13, 2008, at 2:36 AM, Thomas Tempelmann wrote: > I see that Jonanathan and my attempt at the AST with Morphe2 comes > pretty > close to this design, which is a relief to me :) Yes, it's a good design. > All AST classes are derived from "ASTNode". This base class has a > "items" > array in which the "arguments" of any subclass are assembled. I started with such a generic "items" (or "subnodes" as I called it) array in my syntax node design too, but scrapped it because in favor of something more specific to each subclass that needs it. It's not a major issue, but I do feel that pushing these references down into the subclasses will likely reduce errors and slightly improve performance. > The article, under "Binary Expression Parsing", suggests to use a > precedence > table for all binary operators and groups them by this... > > While this makes sense in a hand-coded parser as keeps the code > leaner, > using Morphe2, which interprets a grammar and converts this into > program > code, the programmer (us) usually works with the grammar, not with the > parser code. When you have the grammar in front of you, it's IMO > easier to > write it in the classic way where you outspell the rules for each > precedence > case (i.e. term, product, etc.) because using the precedence table > in the > grammar makes it less readable. I'm not sure I'd agree with that. Spelling out all the rules to embody precedence in a LR grammar is a lot of work, and easy to screw up, in my experience. It's also a nuisance when you add a new operator (or need to change an existing one), as it requires changes in several places; or when somebody asks for a simple list of operators in precedence order. (And LA grammars are even worse.) > I really encourage you all to check out the TTsRBParser and > TTsASTParser > which both use the Morphe2 grammar and output a tree (the former as a > string, which looks similar to what ParserProto does, and the latter > generates a tree of ASTNodes). Just to be clear, ParserProto generates a tree of syntax nodes, and then uses this tree to construct a string if that's what's wanted. (I'm not yet happy with the format of that string -- it's only for debugging purposes, but we still should make it as clear and readable as we can.) > Sure, when you like to learn how parsing works, do it by hand, but > once you > want to keep up with such a complex language as RB, with its often- > changing > syntax, you'll be better off with a grammar you can adjust quickly > to cope > with the changes. I used to think so too -- indeed, I wrote the current RB compiler using Bison. But a few years later, after talking to some other experienced compiler engineers and learning more about real-world compilers (including very well-known ones like gcc), which use hand- coded parsers, I'm starting to turn to the other side. Parser generators are great for use in a classroom setting, or for research, where parsing is just an annoying step to get out of the way before moving onto a single specific problem you're trying to address. But in a real development environment -- especially a highly integrated one -- parsing is an important part of almost everything you do. It's used for editing, refactoring, error reporting, incremental compiling, context-sensitive help, and many other things. Actually writing the parser by hand is not that hard, a small task compared to the rest of the code that uses it -- and doing it by hand enables a lot of things that are hard to do with auto-generated parsers. For example: consider parsing of code fragments. Much editor support needs to parse lines (and even incomplete lines), rather than entire programs. Refactoring tools need to parse blocks of lines. An expression evaluator (like that in RB's properties pane) needs to parse expressions. An auto-generated parser can't generally handle this, because the whole grammar is built around generating a "top" or root nonterminal. Any input that doesn't match all the way to the root nonterminal is rejected. There are ways to shoehorn a grammar into allowing specific sorts of fragments (i.e., at the top level you could explicitly allow a statement or an expression to map to the root), but they're crude and can cause other problems. With a hand-written parser, it's no big deal -- you can call into the parser at any point, for example, at the "ParseExpr" method if you want to just parse an expression. (Or better yet, keep things neat and clean by dividing the parser into chain of subclasses, and then instantiate the class that reflects how much of the grammar you need.) As another example, consider what happens when a quirk is discovered in your grammar and code isn't being parsed as you expected (and yes, this DOES happen!). With a hand-written parser, it's a pretty easy job of stepping through the code and seeing what it's doing. With an auto-generated one on a nontrivial grammar, stepping through the code is meaningless, and you're reduced to fiddling with the grammar file and simulating it in your head to try to understand what's happened. (There are ways to examine the stack as it works and such, but these are of only limited help.) > Let's not keep re-inventing the wheel here but get some _solutions_ > done! If > you se reasons why Morphe2 is not suitable well, then let's hear > them, but > keep in mind that Jonathan wrote Morphe2 especially to write a RB > languge > parsers, so it should meet the needs by design. And if we should > ever get > him to join this group (where is he, he has not commented on > anything yet), > he could maybe give us some guidedance on this as well. I think he's in the group -- Jon, you there? I respect your frustration, but this is a key decision and well worth taking carefully. Are you coming to REAL World next week? If so, maybe we could discuss it more there. Best, - Joe |