rbmetacode-list Mailing List for RBMetaCode
Status: Planning
Brought to you by:
jstrout
You can subscribe to this list here.
2008 |
Jan
|
Feb
(28) |
Mar
(13) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
---|
From: Thomas T. <tt...@te...> - 2008-03-14 18:11:42
|
On 14.03.2008 16:11 Uhr, "Joe Strout" <jo...@st...> wrote: > If I understand you properly, I think the find/replace functionality > should be a visitor Ah, thanks. This concept was new to me but I guess I had something like this in mind, but not as sophisticated. I'm now reading up on this design pattern to see what I can make of it. Thomas |
From: Joe S. <jo...@st...> - 2008-03-14 15:12:00
|
On Mar 14, 2008, at 8:38 AM, Thomas Tempelmann wrote: > Now I like to add Find & Replace functionality and I am looking for > suggestions on how to do this best. > > To search, I need to iterate over all items. Currently, not every > text item > of interest has its own node in the tree. Instead, as explained > above, the > name/label is a property of any node's base class, and some classes > have > accessor methods for reading and settings source, declaration and > notes. > > To iterate over these, I'd currently have to iterate over the > nodes, then > add special handling for the special properties. But how would the > iterator > remember where it's at? If I code it so that it uses a callback > (delegate) > function to handle the finding, it would implicitly remember where > it's at, > but what if the callback function decides to alter the tree so that > the > iteration loop is now working on obsolete items? > > Therefore, I believe it's better to have a "smarter" iterator that > one calls > from one's own loop, getting the next found item, then process this > item, > then proceed in the loop to get the next item from the iterator. > But then, > the iterator has to know where it's at. See the problem? If I understand you properly, I think the find/replace functionality should be a visitor (see SyntaxVisitor and its subclasses in ParserProto for an example). And to avoid the problem of the tree changing while you're traversing it, do it in two stages: first, gather a list of all the matches; and second (if the user wants to replace all without seeing them first), do the replacements. HTH, - Joe |
From: Thomas T. <tt...@te...> - 2008-03-14 14:39:35
|
So, you know, I'm working on this editor... It can now open a RB file (binary only so far), load it into a tree that consists of classes representing the different types (module, class, control, method, property etc.), most of which have common properties (a name or label, parent and children) while some have special properties (source code, method declaration, variable declaration, note, icon, external file reference and so on). Saving works as well, of course. The editor provides an interface that looks similar to the RB IDE, where the user can already edit source and declarations. Now I like to add Find & Replace functionality and I am looking for suggestions on how to do this best. To search, I need to iterate over all items. Currently, not every text item of interest has its own node in the tree. Instead, as explained above, the name/label is a property of any node's base class, and some classes have accessor methods for reading and settings source, declaration and notes. To iterate over these, I'd currently have to iterate over the nodes, then add special handling for the special properties. But how would the iterator remember where it's at? If I code it so that it uses a callback (delegate) function to handle the finding, it would implicitly remember where it's at, but what if the callback function decides to alter the tree so that the iteration loop is now working on obsolete items? Therefore, I believe it's better to have a "smarter" iterator that one calls from one's own loop, getting the next found item, then process this item, then proceed in the loop to get the next item from the iterator. But then, the iterator has to know where it's at. See the problem? So, one problem might be to change my tree design so that every item, including the source, declaration and name, are tree nodes as well. Then the iterator has to deal only with remembering nodes. Would that be the best solution or are there other approaches that come to mind? Thomas |
From: Thomas T. <tt...@te...> - 2008-03-13 23:40:23
|
On 13.03.2008 23:54 Uhr, "Joe Strout" <jo...@st...> wrote: > Spelling out all the rules to > embody precedence in a LR grammar is a lot of work, and easy to screw > up, in my experience. It's already done in the grammar I encluded. It's not that complicated. > For example: consider parsing of code fragments Good point. Hadn't thought of this yet. > Are you coming to REAL World next week? If so, > maybe we could discuss it more there. Yes, I'm looking forward to it. I expect to have my RB editor being in a usable state by then, too. Thomas |
From: Joe S. <jo...@st...> - 2008-03-13 22:54:57
|
On Mar 13, 2008, at 2:36 AM, Thomas Tempelmann wrote: > I see that Jonanathan and my attempt at the AST with Morphe2 comes > pretty > close to this design, which is a relief to me :) Yes, it's a good design. > All AST classes are derived from "ASTNode". This base class has a > "items" > array in which the "arguments" of any subclass are assembled. I started with such a generic "items" (or "subnodes" as I called it) array in my syntax node design too, but scrapped it because in favor of something more specific to each subclass that needs it. It's not a major issue, but I do feel that pushing these references down into the subclasses will likely reduce errors and slightly improve performance. > The article, under "Binary Expression Parsing", suggests to use a > precedence > table for all binary operators and groups them by this... > > While this makes sense in a hand-coded parser as keeps the code > leaner, > using Morphe2, which interprets a grammar and converts this into > program > code, the programmer (us) usually works with the grammar, not with the > parser code. When you have the grammar in front of you, it's IMO > easier to > write it in the classic way where you outspell the rules for each > precedence > case (i.e. term, product, etc.) because using the precedence table > in the > grammar makes it less readable. I'm not sure I'd agree with that. Spelling out all the rules to embody precedence in a LR grammar is a lot of work, and easy to screw up, in my experience. It's also a nuisance when you add a new operator (or need to change an existing one), as it requires changes in several places; or when somebody asks for a simple list of operators in precedence order. (And LA grammars are even worse.) > I really encourage you all to check out the TTsRBParser and > TTsASTParser > which both use the Morphe2 grammar and output a tree (the former as a > string, which looks similar to what ParserProto does, and the latter > generates a tree of ASTNodes). Just to be clear, ParserProto generates a tree of syntax nodes, and then uses this tree to construct a string if that's what's wanted. (I'm not yet happy with the format of that string -- it's only for debugging purposes, but we still should make it as clear and readable as we can.) > Sure, when you like to learn how parsing works, do it by hand, but > once you > want to keep up with such a complex language as RB, with its often- > changing > syntax, you'll be better off with a grammar you can adjust quickly > to cope > with the changes. I used to think so too -- indeed, I wrote the current RB compiler using Bison. But a few years later, after talking to some other experienced compiler engineers and learning more about real-world compilers (including very well-known ones like gcc), which use hand- coded parsers, I'm starting to turn to the other side. Parser generators are great for use in a classroom setting, or for research, where parsing is just an annoying step to get out of the way before moving onto a single specific problem you're trying to address. But in a real development environment -- especially a highly integrated one -- parsing is an important part of almost everything you do. It's used for editing, refactoring, error reporting, incremental compiling, context-sensitive help, and many other things. Actually writing the parser by hand is not that hard, a small task compared to the rest of the code that uses it -- and doing it by hand enables a lot of things that are hard to do with auto-generated parsers. For example: consider parsing of code fragments. Much editor support needs to parse lines (and even incomplete lines), rather than entire programs. Refactoring tools need to parse blocks of lines. An expression evaluator (like that in RB's properties pane) needs to parse expressions. An auto-generated parser can't generally handle this, because the whole grammar is built around generating a "top" or root nonterminal. Any input that doesn't match all the way to the root nonterminal is rejected. There are ways to shoehorn a grammar into allowing specific sorts of fragments (i.e., at the top level you could explicitly allow a statement or an expression to map to the root), but they're crude and can cause other problems. With a hand-written parser, it's no big deal -- you can call into the parser at any point, for example, at the "ParseExpr" method if you want to just parse an expression. (Or better yet, keep things neat and clean by dividing the parser into chain of subclasses, and then instantiate the class that reflects how much of the grammar you need.) As another example, consider what happens when a quirk is discovered in your grammar and code isn't being parsed as you expected (and yes, this DOES happen!). With a hand-written parser, it's a pretty easy job of stepping through the code and seeing what it's doing. With an auto-generated one on a nontrivial grammar, stepping through the code is meaningless, and you're reduced to fiddling with the grammar file and simulating it in your head to try to understand what's happened. (There are ways to examine the stack as it works and such, but these are of only limited help.) > Let's not keep re-inventing the wheel here but get some _solutions_ > done! If > you se reasons why Morphe2 is not suitable well, then let's hear > them, but > keep in mind that Jonathan wrote Morphe2 especially to write a RB > languge > parsers, so it should meet the needs by design. And if we should > ever get > him to join this group (where is he, he has not commented on > anything yet), > he could maybe give us some guidedance on this as well. I think he's in the group -- Jon, you there? I respect your frustration, but this is a key decision and well worth taking carefully. Are you coming to REAL World next week? If so, maybe we could discuss it more there. Best, - Joe |
From: Thomas T. <tt...@te...> - 2008-03-13 08:36:43
|
On 11.03.2008 15:36 Uhr, "Joe Strout" <jo...@st...> wrote: > I ran across this very interesting page today: > > <http://www.llvm.org/docs/tutorial/LangImpl2.html> Thanks. Good reading on some principles, e.g. it explains what the outcome and purpose of a AST shall, and how to design the AST classes. I see that Jonanathan and my attempt at the AST with Morphe2 comes pretty close to this design, which is a relief to me :) (After all, Jonathan did his homework on parser generator design and this was already his second version, rebuilt from scratch, and then had another overhaul once we tried it with the RB grammar.) If you check out the "TTsASTParser" from out "RBMetacode" svn repository at sourceforge.net, you'll find a set of classes in there for the AST. There is an "Expression" class, which then has direct subclasses such as "Operator" and "LiteralTerm". The "Operator" has subclasses for unary and binary operators, and so on. The classes are still missing their data and constructor members, though (that's why this project not finished yet, basically). All AST classes are derived from "ASTNode". This base class has a "items" array in which the "arguments" of any subclass are assembled. E.g, an expression such as "1 + x" will consist of a "TermExpr" (for operators such as "+", "-", "or") with two array items "IntegerLiteral" and, uh .. whoops, I see that there's still missing an "Identifier" node class as a subclass of "Expression". Anyhow, I hope you get the basic idea. To store the direct item ref'd by a ASTNode subclass, maybe a simple Variant would do which then stores the current token (i.e. the TermExpr would store the "+", while the IntegerLiteral would store the "1" from the example above). That way, not every subclass needs their own constructor and accessors for this (too much work to type them all...) So, yes, this is a good article and if you read this, you should have no problem understanding how Morphe2 and TTsASTParser should work. > this chapter of the tutorial in particular has some very useful > nuggets for us -- in particular, the operator precedence parsing that > they use to handle binary operators. That would be more efficient > than the approach currently used in the ParserProto (not sure how it > compares to what Morphe2 generates). The article, under "Binary Expression Parsing", suggests to use a precedence table for all binary operators and groups them by this. So, instead of using special handlers for terms (plus, minus, or) and products (mul, div, and) and so on, it uses one central binary operator handler that deals with them all based on their value from the precedence table. While this makes sense in a hand-coded parser as keeps the code leaner, using Morphe2, which interprets a grammar and converts this into program code, the programmer (us) usually works with the grammar, not with the parser code. When you have the grammar in front of you, it's IMO easier to write it in the classic way where you outspell the rules for each precedence case (i.e. term, product, etc.) because using the precedence table in the grammar makes it less readable. I really encourage you all to check out the TTsRBParser and TTsASTParser which both use the Morphe2 grammar and output a tree (the former as a string, which looks similar to what ParserProto does, and the latter generates a tree of ASTNodes). Sure, when you like to learn how parsing works, do it by hand, but once you want to keep up with such a complex language as RB, with its often-changing syntax, you'll be better off with a grammar you can adjust quickly to cope with the changes. Let's not keep re-inventing the wheel here but get some _solutions_ done! If you se reasons why Morphe2 is not suitable well, then let's hear them, but keep in mind that Jonathan wrote Morphe2 especially to write a RB languge parsers, so it should meet the needs by design. And if we should ever get him to join this group (where is he, he has not commented on anything yet), he could maybe give us some guidedance on this as well. Thomas |
From: Joe S. <jo...@st...> - 2008-03-11 14:37:14
|
I ran across this very interesting page today: <http://www.llvm.org/docs/tutorial/LangImpl2.html> This is part of a tutorial for LLVM, in which you implement an entire (simple) language in about an hour or so. Most of the heavy lifting is done by LLVM itself, which is a topic for another time (LLVM is an open-source compiler infrastructure that is very impressive). But this chapter of the tutorial in particular has some very useful nuggets for us -- in particular, the operator precedence parsing that they use to handle binary operators. That would be more efficient than the approach currently used in the ParserProto (not sure how it compares to what Morphe2 generates). I plan to go through this whole tutorial sometime soon (perhaps while travelling next week, if I can get the LLVM code to work on my ancient Powerbook). But I'd recommend at least this chapter to anyone here who's interested in parsing RB code. Best, - Joe -- Joe Strout Inspiring Applications, Inc. http://www.InspiringApps.com |
From: Thomas T. <tt...@te...> - 2008-03-04 20:24:01
|
On 04.03.2008 21:16 Uhr, "Joe Strout" <jo...@st...> wrote: > We need to take care to distinguish between an AST and a CST > (Concrete Syntax Tree), though. A CST can always be converted into > an AST, but the reverse is not true Good point, so we should probably strive to generate a CST from the source parsing, then continue to work with that tree for further operations. Thomas |
From: Joe S. <jo...@st...> - 2008-03-04 20:17:01
|
On Mar 4, 2008, at 11:55 AM, Thomas Tempelmann wrote: >> Option "A" is a lot cheaper in such an environment, because you just >> rerun the mini-parser when an external interface changes. > > I'd rather go with Option B, though: It should be faster (parsing the > original source code is rather slow compared to traversing the > generated > tree and looking for the right types) and it's more universal as > we're first > building a more general tree which we then attribute with extra > information. > > With Option B, we also can see better what's left (e.g. all the > identifiers > from the RB framework), filter them and recreate a table from that > to use as > a prototype of the RB framework. > > What do you and others think? I'll wait for other opinions, and let this all sink in a bit, before I opine any more on this issue myself. > In any case, what should be the structure we're generating for the > tree? > Does the one the TTsASTParser tries to generate make sense to you? > Basically, it uses separate classes for each type of grammar node, for > example it has a unique class for each possible operator. I think > that's a > bit of overkill - I'd rather have a class for each of the operator > priorities (one for plus/minus/or, one for mul/div/and, one for > comparisons, > one for not and negate, etc., and then add a property naming the > particular > operator). That way, the hierarchy of operators is seen right in > the tree. I agree that a separate class for each operator is overkill. I'm not even convinced you need a separate class for different priorities. The parser needs separate rules for them (well, maybe, depending on what type of parser it is), but the AST shouldn't care about that -- by the time you have an AST, operator precedence no longer matters. It's already been taken care of. So I've been using just a BinopNode to represent all binary operator nodes, and it's been working fine for me. (Actually, that's a bit of an exaggeration; it's only for binary operators within expressions. The assignment operator is a separate node.) > Similarly, one class for all kind of loops, maybe, and a sub-class > identifying the different kinds of loops with their parameters. But > maybe > this won't work because it requires that the code is free of errors > such as > a missing "next". Maybe it's better that the initial tree is free > of such > semantics? Well, I think you do need nodes that represent various kinds of loops -- and organizing these into a class for all loops with a subclass for specific kinds of loops makes sense to me. That way, any code that needs to deal with loops in general can address the superclass, and code that needs to be more specific can deal with the subclasses. We need to take care to distinguish between an AST and a CST (Concrete Syntax Tree), though. A CST can always be converted into an AST, but the reverse is not true. There area lot of minor differences, but for example, an AST won't contain any parentheses, since those don't affect the abstract syntax; the guts of the parentheses will simply appear as an expression node directly within whatever node is using that parenthesized expression. But in a CST, the parens are like an operator, represented with their own node in the tree. I've been generating a CST, and even keeping track of noncoding tokens (i.e. whitespace), since that's actually more general -- you can do pretty much anything with it you could do with an AST, but also more, such as reconstructing the original source. That'll be important for things like refactoring tools, where you want to rearrange the source tree a bit, but maintain the user's original syntax as much as possible. Best, - Joe |
From: Thomas T. <tt...@te...> - 2008-03-04 18:56:59
|
Joe, thanks for explaining. > Option "A" is a lot cheaper in such an environment, because you just > rerun the mini-parser when an external interface changes. I'd rather go with Option B, though: It should be faster (parsing the original source code is rather slow compared to traversing the generated tree and looking for the right types) and it's more universal as we're first building a more general tree which we then attribute with extra information. With Option B, we also can see better what's left (e.g. all the identifiers from the RB framework), filter them and recreate a table from that to use as a prototype of the RB framework. What do you and others think? In any case, what should be the structure we're generating for the tree? Does the one the TTsASTParser tries to generate make sense to you? Basically, it uses separate classes for each type of grammar node, for example it has a unique class for each possible operator. I think that's a bit of overkill - I'd rather have a class for each of the operator priorities (one for plus/minus/or, one for mul/div/and, one for comparisons, one for not and negate, etc., and then add a property naming the particular operator). That way, the hierarchy of operators is seen right in the tree. Similarly, one class for all kind of loops, maybe, and a sub-class identifying the different kinds of loops with their parameters. But maybe this won't work because it requires that the code is free of errors such as a missing "next". Maybe it's better that the initial tree is free of such semantics? Thomas |
From: Joe S. <jo...@st...> - 2008-03-04 18:28:34
|
On Mar 4, 2008, at 10:47 AM, Thomas Tempelmann wrote: > Folks, I've checked in Jonathan's Morphe2, which is a quite well > working > parser generator. > > I've made a grammar for it which is not fully tested yet but tries > to deal > with some language quirks of RB in a smarter way (e.g. Jon, Mars > and Aaron > had not understood the concept of a "designator" (this is how I > call it in > the grammar, I remember that compiler designers have another word > for it but > I forgot) and that's why there's the confusion in the compiler > about what's > an expression and why one can pass a class property "x" to a ByRef > parm, but > not "me.x". That's a nice enhancement. And good job digging up "lvalue" for that -- we should try to stick to standard terminology as much as possible, so that when new people come on board with compiler training, they will be able to jump right in and not have to learn our quirky terminology. > So, I encourage you to look at Morphe2 and the two RB grammars I > and Jon > made for it in early 2006 (inside TTsRBParser and TTsASTParser). I've only done a cursory scan of this so far, but it does look very intriguing. I hope to look at it more deeply soon. Thanks for getting it checked into the repository. > I must say that I have no experience with ASTs, I only know I what > I need in > the end: > > A tree that recreates the source in a way where identifier > references become > objects that point to the declaration of such identifiers, so that > one can > modify the identifier's declaration and have it affect all references > automatically. > > That will help understanding all relationships in the source code for > refactoring, searching etc. Yes very true -- though that's not an AST, in itself; it's something like an AST plus a symbol table (or set of symbol tables). You actually can't resolve all symbols in one pass over the code, since some things may be references to stuff you haven't parsed yet. So there are a couple ways to deal with this: A. Have a separate mini-parser that scans the code, pulling out only the external declarations (things that are accessible from elsewhere, like classes, modules, methods, and properties, but not local variables). This gets you an initial set of symbol tables. Then run a full parser, which builds the AST, referencing the external symbols and also adding local symbol tables as it goes. ...or... B. Run the full parser right away, but leave symbol references unresolved (or at least, most of them -- I guess you could resolve local ones). This generates an AST, and also gathers up all the external references. It could in fact generate a complete set of symbol tables by the time it's done, but most of the symbol references in the AST would remain unresolved. Then, you go back in a separate pass and resolve the symbol references; either as its own step, or as part of doing whatever you do next (e.g., compiling the AST). In our case, we're not really writing a compiler; we're writing code analysis tools. For many of those, having unresolved symbols in the AST isn't terribly useful. Also, if you're making something like an editor, redoing a full parse of the whole source base whenever you think an external reference may have changed is pretty expensive. Option "A" is a lot cheaper in such an environment, because you just rerun the mini-parser when an external interface changes. So, I think to move forward we would need to do something like the following: 1. Define a symbol table data structure -- something to hold symbol declarations and represent scope (all the way from global to local, and everything in between). 2. Make a mini-parser that can scan a project and pull out all nonlocal symbols. 3. Make a full parser that takes the set of nonlocal symbol tables along with a hunk of code (which could be the whole program again, or maybe just one method). This should generate an AST (or a concrete ST, depending on what you intend to do with it), and tie each identifier to a specific entry in some symbol table (or else tell you that it's undefined). In an editor, we'd only do step 3 on whatever method the user is editing (most of the time). That'll tell us where any identifier is declared, what its type is, and what its scope is, which is the sort of thing we need to do syntax coloring and refactoring and so on. Best, - Joe |
From: Thomas T. <tt...@te...> - 2008-03-04 17:52:25
|
On 04.03.2008 18:47 Uhr, "Thomas Tempelmann" <tt...@te...> wrote: > the concept of a "designator" (this is how I call it in > the grammar, I remember that compiler designers have another word for it but > I forgot) I just remembered: L-Value! See: http://en.wikipedia.org/wiki/Value_(computer_science) |
From: Thomas T. <tt...@te...> - 2008-03-04 17:48:41
|
Folks, I've checked in Jonathan's Morphe2, which is a quite well working parser generator. I've made a grammar for it which is not fully tested yet but tries to deal with some language quirks of RB in a smarter way (e.g. Jon, Mars and Aaron had not understood the concept of a "designator" (this is how I call it in the grammar, I remember that compiler designers have another word for it but I forgot) and that's why there's the confusion in the compiler about what's an expression and why one can pass a class property "x" to a ByRef parm, but not "me.x". Anyhow. I am not a good designer either, which many see when they look at my rather convoluted code to read project files (my "RB Prj Tools") versus the lean code by Seth or by Joe Ranieri. I'm more of a "detail" person :) So, I encourage you to look at Morphe2 and the two RB grammars I and Jon made for it in early 2006 (inside TTsRBParser and TTsASTParser). TTsRBParser was my first attempt to use Morphe2 to simply parse RB source code. It generates a rather symbolic "tree". If you run the project you'll see it printed in the log windows. TTsASTParser was then based on Jon's attempt to create an AST. IIRC, I tried to modify his grammar based on my own understanding how the grammar should look. The actions in the grammar to create all those Classes is basically from Jon, I just made some modifications to their hierarchy etc. I must say that I have no experience with ASTs, I only know I what I need in the end: A tree that recreates the source in a way where identifier references become objects that point to the declaration of such identifiers, so that one can modify the identifier's declaration and have it affect all references automatically. That will help understanding all relationships in the source code for refactoring, searching etc. I have my doubts that the current attempt in TTsASTParser is the right way to do that. Can someone have a look at it and comment on this? Since you others need a tree as well, what's your goals? Do you agree with the requirements I defined above? How do you plan to go about it? Thomas |
From: Joe S. <jo...@st...> - 2008-02-29 21:55:17
|
On Feb 29, 2008, at 1:18 PM, Thomas Tempelmann wrote: > On 22.02.2008 0:17 Uhr, "Joe Strout" <jo...@st...> wrote: > >> I've created a simple folder hierarchy in subversion, with "main" and >> "prototype" at the root level. Under prototype, I've checked in >> Seth's work for BKeeney Software. > > Assuming one of us would update the Lexer, would we update it in > the main > dir only, or should we update the prototype folder as well? > > I guess I do not understand the idea of the prototype folder yet. Well, the idea behind the prototype folder is a place to store things that may or may not turn out to be generally useful enough to be incorporated into the main project. We may have several different lexers, parsers, etc. hanging out under prototypes, and we make no claims about them (other than they're available under the MIT License). The main project, as I see it, is where we collect the code that we have generally found to be most useful for the project aims (i.e. RB code analysis) -- ideally, just one lexer, one parser, one project reader, and so on, and that's the code we try to keep up to date and always working with the latest version of RB and so on. But the prototype folder could also serve as a place for side projects -- things which are actively maintained by one or more members of the group, but not by the group as a whole. So, for example, if Bob and Seth want to continue updating their BKeeneyProjReader project and making those updates available via the SourceForge project, that's fine too. I realize this is a bit nebulous, since our main project isn't really a product in itself, but just a collection of code that would be useful for making a variety of projects. But it's the best distinction I could come up with. I'm open to suggestions if anybody has them. Best, - Joe |
From: Thomas T. <tt...@te...> - 2008-02-29 20:19:12
|
On 22.02.2008 0:17 Uhr, "Joe Strout" <jo...@st...> wrote: > I've created a simple folder hierarchy in subversion, with "main" and > "prototype" at the root level. Under prototype, I've checked in > Seth's work for BKeeney Software. Assuming one of us would update the Lexer, would we update it in the main dir only, or should we update the prototype folder as well? I guess I do not understand the idea of the prototype folder yet. |
From: Joe S. <jo...@st...> - 2008-02-29 16:43:42
|
There is a really cool graph visualization tool that's commonly used in compiler work. It's cross-platform, though in my opinion the Mac version is especially cool (it's won Apple Design Awards). Get it here: <http://www.pixelglow.com/graphviz/> I'm planning to add a syntax tree visitor that outputs a description of the tree in a format that GraphViz can display. (My wife does this in the compiler courses she teaches, and it's very handy.) So please check it out when you get the chance. Best, - Joe -- Joe Strout Inspiring Applications, Inc. http://www.InspiringApps.com |
From: Joe S. <jo...@st...> - 2008-02-28 21:43:22
|
I thought perhaps the parser issue was too big and slippery to attack on a purely intellectual level, so I've decided to try making a small parser so we can start getting some experience with it. I should say that I know Thomas and Jon (and perhaps others?) have already done a lot of work in this direction using Jon's nifty parser generator, and that's still an option we should explore. But I want to also explore the hand-written approach, not least because I have an experienced compiler engineer whom I trust advising me to do so. So, if you do a svn update, you'll find a "ParserProto" project. It handles this grammar: Exp: Term Term: Term ('+' Factor)* | Factor Factor: Factor ('*' Atom)* | Atom Atom: <identifier> | <number> So it can handle things like 2+3*4. It parses this into a concrete syntax tree (which is to say, if we were handling parentheses, they would actually be represented in the tree rather than abstracted away right away). There is just one SyntaxNode class, though I imagine we'd probably want a hierarchy of classes if we got more serious about it. Each node corresponds to a coding token, pretty much, as well as a series of noncoding tokens (i.e. whitespace) that may follow it. For inspection purposes, these nodes can print themselves out with or without the noncoding tokens, or can try to reconstruct the original source. This all seems to be working fine, except for any leading whitespace -- I haven't found a good place to put that. (I also didn't have much luck trying to capture a trailing comment; I think to handle both of these, we need a StatementNode subclass that can keep track of leading and trailing noncoding tokens separately.) Big-picture wise, I think we should take a simple subset of the grammar like this, and see how far we can take it as far as error handling, relating any node in the tree to a range of text in the original input, dealing with code fragments, etc. Actually, the current grammar is probably TOO simple; we should add line continuation, a couple more operators, and maybe function calls too. Then maybe we take that grammar subset, and try it out several different ways if necessary to convince ourselves that we've found the best approach for our purposes -- and then push on to implement the full language grammar. Any comments? Best, - Joe |
From: Joe S. <jo...@st...> - 2008-02-28 16:36:58
|
Hi gang, Seth recently checked in an update to BKeeney's "OpenProjectReader" project, in the prototypes section of our repository. It eliminates the dependency on plugins and also fixed up the Unicode support in the lexer. So today, I integrated that lexer into the main project, and added unit tests as well as an interactive test window. You should give it a try -- it's fun! Just click the "Test Lexer" button on the main window, then type whatever source code you like. I encourage you to see if you can come up with some weird (yet valid) RB source that the lexer fails to handle properly. If I may be permitted to gush a bit, Seth's code was awesome. Clean, well-organized, interfaces well designed, private parts all properly hidden, no unnecessary dependencies... everything about it was a pleasure to work with. About the only thing I changed was to rename the StartPos and EndPos properties in Token to StartPosB and EndPosB, since it turns out that these are binary positions rather than character. For examples of how to use the lexer classes, see the "RunUnitTests" shared method that each now sports, or look at the lexer test window. It's really quite easy. I'd say we have a great lexer, and it's time to start thinking seriously about parsing. (Thank you, Seth and Bob!) Best, - Joe |
From: Bob K. <bo...@bk...> - 2008-02-21 23:55:25
|
On Feb 21, 2008, at 5:47 PM, Thomas Tempelmann wrote: > On 22.02.2008 0:17 Uhr, "Joe Strout" <jo...@st...> wrote: > >> Under prototype, I've checked in >> Seth's work for BKeeney Software. > > Could it be that it needs some plugins? It's missing "TreeView". > Smells like > Einhugur. Yes, it currently needs the Einhugur TreeView plugin. Obviously that's something we'll have to take care of in the near future to make it compatible with the RB listbox as not everyone uses the Einhugur controls. Bob Keeney BKeeney Software Inc. |
From: Thomas T. <tt...@te...> - 2008-02-21 23:47:36
|
On 22.02.2008 0:17 Uhr, "Joe Strout" <jo...@st...> wrote: > Under prototype, I've checked in > Seth's work for BKeeney Software. Could it be that it needs some plugins? It's missing "TreeView". Smells like Einhugur. |
From: Joe S. <jo...@st...> - 2008-02-21 23:18:05
|
I've created a simple folder hierarchy in subversion, with "main" and "prototype" at the root level. Under prototype, I've checked in Seth's work for BKeeney Software. I'll check in more relevant stuff as we get clear license from their copyright holders, but this is a good start. Under main, I've added a starter project that we can use to collect and verify the pieces we draw from the prototypes. Lexer is first up, probably taking from Seth's and extending it in the directions we discussed this morning (plus shoring up its Unicode support). Our Subversion home page is here: <http://sourceforge.net/svn/? group_id=218185> That includes instructions on checking out the code using command- line svn, as well as a link to browse the repository on the web. Best, - Joe |
From: Joe S. <jo...@st...> - 2008-02-21 18:57:00
|
On Feb 21, 2008, at 11:36 AM, Thomas Tempelmann wrote: > What does the MIT License say about re-using the code in a closed- > source > app? It's fine with it. > E.g, while I like to contribute most of my work to other for free, > I am also > working on a "pro" app which I like to sell (it shall be an RB > source code > editor with features for project comparison, mainly). I see no problem with that. > I mean, as much as I like the open source idea in general, in this > case, I'd > prefer a solution where only those parts that we exchange have to > be kept > open, not the entire projects that include some of the open parts. Right, me too. Best, - Joe |
From: Thomas T. <tt...@te...> - 2008-02-21 18:36:54
|
What does the MIT License say about re-using the code in a closed-source app? E.g, while I like to contribute most of my work to other for free, I am also working on a "pro" app which I like to sell (it shall be an RB source code editor with features for project comparison, mainly). Must I keep my fingers off your contributions for that? Or can we handle it the way that the components, should I improve them, are the only thing I have to give back? I mean, as much as I like the open source idea in general, in this case, I'd prefer a solution where only those parts that we exchange have to be kept open, not the entire projects that include some of the open parts. Thomas |
From: Seth V. <se...@bk...> - 2008-02-21 16:56:08
|
>> >> It seems like having this functionality in the lexer would limit the >> implementation options. Unless you're just talking about syntactic >> sugar for skipping tokens until an endofline is reached, this seems >> to imply a line oriented buffering strategy. > > No, I don't think it implies anything about the buffering. In UTF-8, > an end of line character can't occur as part of any other character. > So if you have one big continuous buffer, then the skip-to-next-line > method can simply scan ahead in the buffer looking for the next EOL. > This would be quite a bit faster than grabbing and discarding tokens > along the way. Excellent point. > >>> - its position and extent in the source chunk (in bytes or >>> characters?) >> >> Deciding this in advance also seems unnecessarily limiting. > > But this is absolutely necessary, since it's part of the lexer > interface. We should be able to swap out one lexer for another (e.g. > because we decide to move lexing to a plugin someday) without > breaking the rest of the system. But that means that, if we provide > positions at all, we have to specify in advance whether those are > characters or bytes. > > ...But, on the other hand, there's no telling which the caller is > going to need. It depends entirely on what they're going to do with > that information. It's always possible to convert from one to the > other, but this conversion can be expensive (at least when going from > bytes to chars). I was thinking more along the lines of treating the position as an abstract type. If you have routines to extract a substring and replace the text between two positions, what difference does it make if it's bytes or characters? That works well for the uses I've got in mind, but other people might have different ideas :) The caveat would be that people would have to resist the urge to assume that it's one or the other. > > Another thing to think about: should the token positions be absolute > positions relative to the entire source chunk the lexer was given -- > or should they be relative to the start of the line they're on? > Positions within the line are usually more useful, since almost > anything you would do with a lexer (syntax coloring, code formatting, > even compiling) tends to work on either the line or the statement > level. But at the moment, I'm inclined to feel that the lexer > shouldn't be worrying about this -- it's easy enough for the caller > to keep track of line or statement positions itself (especially if > the lexer returns a start-of-line or end-of-line token). I prefer to have the positions be absolute. It seems like it will be easier that way down the line, because the positions can be propagated to the AST and then used to do substitutions back into the original code based on modifications to the tree. The position within the line is better for error messages and such, but that can be handled pretty easily as you pointed out. > > Best, > - Joe > > > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Rbmetacode-list mailing list > Rbm...@li... > https://lists.sourceforge.net/lists/listinfo/rbmetacode-list > |
From: Bob K. <bo...@bk...> - 2008-02-21 16:19:46
|
Yes, BKeeney Software is releasing the code under the MIT license. Bob Keeney BKeeney Software Inc. On Feb 21, 2008, at 9:52 AM, Joe Strout wrote: > I'll try to get that subversion area set up today. Seth, are you OK > releasing the code that you sent me earlier under the MIT license? |