Thread: [RBMetacode] Lexer thoughts
Status: Planning
Brought to you by:
jstrout
From: Joe S. <jo...@st...> - 2008-02-21 04:03:39
|
Getting down to business now: let's start with the lexer. We have several good lexers on hand already, so this should be mostly a matter of checking with the copyright holders, and adapting one or more of these to a common interface (plus fixing any errors the unit tests may uncover). We do need to figure out what that interface should be, though, by considering first the uses to which our lexer may be put: - Syntax coloring/styling - Feeding tokens to a parser (which may or may not need noncoding tokens, e.g. whitespace and comments) - Edit-time error reporting (i.e. notifying the user of malformed tokens right away) - Other fancy editor features (code folding, paren matching, etc.) - Declaration mining (i.e. finding declared variables, classes, methods, etc.) So, we want a lexer that is fast and correct, but also has the capability of supporting all of the above functions. The typical lexer paradigm is to give it a chunk of source, and then ask it for a token at a time; and if the "chunk" includes multiple lines, it should be able to skip to the next line (so that if you're looking for an "End Method" you can skip any lines that don't start with that). The information we need about any token would be: - its type (identifier, operator, string literal, color literal, etc.) - its value (which operator, what string, etc.) -- or maybe just its raw text? - its position and extent in the source chunk (in bytes or characters?) - whether it is malformed (e.g. &h3G, or 123.345.7) I don't think we need the lexer itself to do any actual error reporting; it should probably just return the next token, whatever it is, but note (via that last flag in the list above) if the token appears invalid. What do y'all think? Are we missing anything? Best, - Joe |
From: Thomas T. <tt...@te...> - 2008-02-21 10:00:34
|
On 21.02.2008 5:03 Uhr, "Joe Strout" <jo...@st...> wrote: > We have several good lexers on hand already, so this should be mostly > a matter of checking with the copyright holders FWIW, you're welcome to use my Lexer (TTsRBLexer, inside "RB Parser.rb" which I sent to Joe). I release it to the Public Domain, so you are not bound to any rules. I optimized it to be fast. I challenge you to find a faster one (and if you do, let me know, I like to learn, too :) The one I wrote was based on my knowledge of compiler design (I had maintained a Modula-2 compiler already 20 years ago, and also partook in its international standards committee), so I hope it's doing a good job. Actually, I suggest you adopt several lexers to a common API, such as the one Morphe2 uses, and then test them all against each other, for speed and reliability. Thomas |
From: Thomas T. <tt...@te...> - 2008-02-21 10:06:45
|
I also like to point out that there's already code that reformats source code with its proper indentation. We've created and tested that code two years ago, and I'm using it in the editor which I am working on now. It does not use a lexer but simpler algorithms, and while it's still possible to fool it with some very unusual constructs, it does a pretty good job in general. It has several functions that even allow you to keep a cursor position while the source gets reformatted. This makes it suitable to be used while someone edits the code in an EditField. The source comes as a module, and can be found in the RBPrjTools on my website: http://www.tempel.org/rb/#prjtool Thomas |
From: Seth V. <se...@bk...> - 2008-02-21 14:35:13
|
It might be easier to decide some of these questions if we can see what's in progress. Do we want to exchange early prototype code on this list or set up an area in subversion for that? I've got a lexer that I think addresses most of the areas below and shouldn't be difficult to extend, it's also pretty quick, I think, but if there's a faster way I'm open to suggestions. On Feb 20, 2008, at 10:03 PM, Joe Strout wrote: > > - Syntax coloring/styling > - Feeding tokens to a parser > (which may or may not need noncoding tokens, e.g. whitespace and > comments) > - Edit-time error reporting (i.e. notifying the user of malformed > tokens right away) > - Other fancy editor features (code folding, paren matching, etc.) > - Declaration mining (i.e. finding declared variables, classes, > methods, etc.) > and if the "chunk" includes multiple lines, it > should be able to skip to the next line (so that if you're looking > for an "End Method" you can skip any lines that don't start with > that). It seems like having this functionality in the lexer would limit the implementation options. Unless you're just talking about syntactic sugar for skipping tokens until an endofline is reached, this seems to imply a line oriented buffering strategy. That may be a good way of buffering input, but it doesn't seem like we need to settle on that in advance. > > The information we need about any token would be: > - its type (identifier, operator, string literal, color literal, > etc.) > - its value (which operator, what string, etc.) -- or maybe just > its raw text? Currently I'm just storing raw text, although it wouldn't be very difficult to store the decoded value. For what we're doing, it seems like the raw text is enough. Since we're not actually compiling the code, it isn't guaranteed that we'll even need the value in any other form. > - its position and extent in the source chunk (in bytes or > characters?) Deciding this in advance also seems unnecessarily limiting. If the lexer uses RB's built-in string handling, then characters are going to be the natural choice. If the lexer uses a MemoryBlock, then bytes are going to be the natural choice. > - whether it is malformed (e.g. &h3G, or 123.345.7) Hmm, I can see that a unit testing facility is going to come in handy. My code will translate those as a hexLiterial followed by an identifer, and two floating point numbers. They would be caught as invalid syntax by the parser, though. > > I don't think we need the lexer itself to do any actual error > reporting; it should probably just return the next token, whatever it > is, but note (via that last flag in the list above) if the token > appears invalid. I agree. I started out throwing exceptions on illegal input, but that's not really going to play nice with the parser. > > What do y'all think? Are we missing anything? > > Best, > - Joe > > > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Rbmetacode-list mailing list > Rbm...@li... > https://lists.sourceforge.net/lists/listinfo/rbmetacode-list > |
From: Thomas T. <tt...@te...> - 2008-02-21 14:49:08
|
On 21.02.2008 15:35 Uhr, "Seth Verrinder" <se...@bk...> wrote: >> I don't think we need the lexer itself to do any actual error >> reporting; it should probably just return the next token, whatever it >> is, but note (via that last flag in the list above) if the token >> appears invalid. > > I agree. I started out throwing exceptions on illegal input, but > that's not really going to play nice with the parser. Yes, exceptions are not smart in a lexer that has to deal with input that is likely to have errors in it. Here's how my Lexer solves it: When it encounters an error, e.g. a string literal with a missing quote sign at the end, it retuns a special "error" token. The parser can then skip simply to the next line to recover from it. This works well because RB is a line-oriented language (as opposed to C, for instance) Only care has to be taken that skipping to next line means that lines with "_" at their end needs to be skipped as well. Thomas |
From: Joe S. <jo...@st...> - 2008-02-21 15:18:10
|
On Feb 21, 2008, at 7:48 AM, Thomas Tempelmann wrote: > Yes, exceptions are not smart in a lexer that has to deal with > input that is > likely to have errors in it. > > Here's how my Lexer solves it: When it encounters an error, e.g. a > string > literal with a missing quote sign at the end, it retuns a special > "error" > token. The parser can then skip simply to the next line to recover > from it. Yes, I've used that approach in the past too, but found that it tends to cause complications further down with subsequent tools. Say, for example, that you are using the lexer for syntax coloring. Just because somebody has typed x = &h3G04 + Val(foo) and &h3G04 is a malformed token, doesn't mean that we don't want to still correctly color Val, Foo, and the operator and parens. In fact, I'd prefer to see &h3G04 correctly colored like other hex literals too, just with a red dashed underline like a misspelled word in Mail or TextEdit. As another example, suppose we're feeding the above into a parser to generate an AST, which we're then going to use to standardize the spacing or whatever. If the lexer returns &h3G04 as a hex literal with a "malformed" flag set, the parser (which looks mainly at the token type) will continue to work, and we'll get a Statement node containing an Assignment with an Expression on the right-hand side, as usual. But if the lexer returns &h3G04 as an error token, then our parser will break, unless our grammar is riddled with special cases to handle an error at almost any point. We won't get back a proper AST node, and won't be able to format the line. Of course the case of a missing close quote is different, since that really does screw things up all the way to the end of the line. But there are other malformed tokens that are pretty easy to continue past, and it'd be nice to treat them in a uniform manner. > This works well because RB is a line-oriented language (as opposed > to C, for > instance) Yes, this is a real advantage for us -- it makes it easy to search for things that you know have to be at the start of a line, like block openers/closers. > Only care has to be taken that skipping to next line means that > lines with > "_" at their end needs to be skipped as well. Yes again, that's a wrinkle to be constantly aware of. Even RB doesn't always handle this feature very well (e.g., breakpoints on continued lines). Best, - Joe |
From: Thomas T. <tt...@te...> - 2008-02-21 15:20:49
|
On 21.02.2008 16:17 Uhr, "Joe Strout" <jo...@st...> wrote: > But if the lexer returns &h3G04 as an error token, then > our parser will break, unless our grammar is riddled with special > cases to handle an error at almost any point. Good points. Maybe the way Antlr works with "channels" may help here? |
From: Joe S. <jo...@st...> - 2008-02-21 15:30:43
|
On Feb 21, 2008, at 8:20 AM, Thomas Tempelmann wrote: >> But if the lexer returns &h3G04 as an error token, then >> our parser will break, unless our grammar is riddled with special >> cases to handle an error at almost any point. > > Good points. > > Maybe the way Antlr works with "channels" may help here? Could be -- that is a pretty clever system. At the lexer level though, I figure it will suffice to have a couple of mode flags: - should the lexer skip noncoding tokens (whitespace/comments/line- continuation)? - should the lexer skip the rest of the line after detecting an error? And actually, maybe we don't need the second one, since whatever's invoking the lexer could easily watch for an error flag itself, and call the lexer's SkipToNextLine (or whatever) method if that's what it wants to do. Best, - Joe |
From: Seth V. <se...@bk...> - 2008-02-21 15:49:02
|
On Feb 21, 2008, at 9:30 AM, Joe Strout wrote: > - should the lexer skip the rest of the line after detecting an error? > > And actually, maybe we don't need the second one, since whatever's > invoking the lexer could easily watch for an error flag itself, and > call the lexer's SkipToNextLine (or whatever) method if that's what > it wants to do. > My vote's on leaving it up to the calling code to take whatever action makes sense. It may be different depending on what the lexer is being used for, so why clutter the interface? Seth |
From: Joe S. <jo...@st...> - 2008-02-21 15:55:00
|
On Feb 21, 2008, at 8:48 AM, Seth Verrinder wrote: >> - should the lexer skip the rest of the line after detecting an >> error? >> >> And actually, maybe we don't need the second one, since whatever's >> invoking the lexer could easily watch for an error flag itself, and >> call the lexer's SkipToNextLine (or whatever) method if that's what >> it wants to do. >> > > My vote's on leaving it up to the calling code to take whatever > action makes sense. It may be different depending on what the lexer > is being used for, so why clutter the interface? Good call. I agree. And maybe we should apply the same logic to skipping noncoding tokens. It's standard in CS lexers to do this, i.e. for lexers to scan right past whitespace (and often comments). But that's because they're not going to do anything with the tokens except compile them -- we're going to be doing other things, so maybe we should just always return everything you would need to rebuild the original code, and if the caller doesn't need those, it can just discard them based on the token type. Best, - Joe |
From: Joe S. <jo...@st...> - 2008-02-21 15:52:44
|
On Feb 21, 2008, at 7:35 AM, Seth Verrinder wrote: > It might be easier to decide some of these questions if we can see > what's in progress. Do we want to exchange early prototype code on > this list or set up an area in subversion for that? We should use subversion. I like your idea of an early prototype area, and then perhaps a main development area where we integrate things. > I've got a lexer that I think addresses most of the areas below and > shouldn't be difficult to extend, it's also pretty quick, I think, > but if there's a faster way I'm open to suggestions. From what I've seen, I agree, yours does 95% of what we need, and very fast too. However, at the design stage I prefer to focus on what we want, rather than what we have, to avoid settling for what we have rather than what we really want. :) I'll try to get that subversion area set up today. Seth, are you OK releasing the code that you sent me earlier under the MIT license? >> and if the "chunk" includes multiple lines, it >> should be able to skip to the next line (so that if you're looking >> for an "End Method" you can skip any lines that don't start with >> that). > > It seems like having this functionality in the lexer would limit the > implementation options. Unless you're just talking about syntactic > sugar for skipping tokens until an endofline is reached, this seems > to imply a line oriented buffering strategy. No, I don't think it implies anything about the buffering. In UTF-8, an end of line character can't occur as part of any other character. So if you have one big continuous buffer, then the skip-to-next-line method can simply scan ahead in the buffer looking for the next EOL. This would be quite a bit faster than grabbing and discarding tokens along the way. > That may be a good way of buffering input, but it doesn't seem like > we need to settle on > that in advance. Agreed, implementation details should be hidden and mostly irrelevant from the outside. >> The information we need about any token would be: >> - its type (identifier, operator, string literal, color literal, >> etc.) >> - its value (which operator, what string, etc.) -- or maybe just >> its raw text? > > Currently I'm just storing raw text, although it wouldn't be very > difficult to store the decoded value. For what we're doing, it seems > like the raw text is enough. Since we're not actually compiling the > code, it isn't guaranteed that we'll even need the value in any other > form. Good point. I'm convinced. >> - its position and extent in the source chunk (in bytes or >> characters?) > > Deciding this in advance also seems unnecessarily limiting. But this is absolutely necessary, since it's part of the lexer interface. We should be able to swap out one lexer for another (e.g. because we decide to move lexing to a plugin someday) without breaking the rest of the system. But that means that, if we provide positions at all, we have to specify in advance whether those are characters or bytes. ...But, on the other hand, there's no telling which the caller is going to need. It depends entirely on what they're going to do with that information. It's always possible to convert from one to the other, but this conversion can be expensive (at least when going from bytes to chars). I'm starting to wonder if maybe the lexer should keep track of both; it's iterating over both bytes and characters already, so this wouldn't be a lot of extra work for it. Then the caller can use whichever it needs, and we wash ourselves of that headache early on. Another thing to think about: should the token positions be absolute positions relative to the entire source chunk the lexer was given -- or should they be relative to the start of the line they're on? Positions within the line are usually more useful, since almost anything you would do with a lexer (syntax coloring, code formatting, even compiling) tends to work on either the line or the statement level. But at the moment, I'm inclined to feel that the lexer shouldn't be worrying about this -- it's easy enough for the caller to keep track of line or statement positions itself (especially if the lexer returns a start-of-line or end-of-line token). Best, - Joe |
From: Bob K. <bo...@bk...> - 2008-02-21 16:19:46
|
Yes, BKeeney Software is releasing the code under the MIT license. Bob Keeney BKeeney Software Inc. On Feb 21, 2008, at 9:52 AM, Joe Strout wrote: > I'll try to get that subversion area set up today. Seth, are you OK > releasing the code that you sent me earlier under the MIT license? |
From: Thomas T. <tt...@te...> - 2008-02-21 18:36:54
|
What does the MIT License say about re-using the code in a closed-source app? E.g, while I like to contribute most of my work to other for free, I am also working on a "pro" app which I like to sell (it shall be an RB source code editor with features for project comparison, mainly). Must I keep my fingers off your contributions for that? Or can we handle it the way that the components, should I improve them, are the only thing I have to give back? I mean, as much as I like the open source idea in general, in this case, I'd prefer a solution where only those parts that we exchange have to be kept open, not the entire projects that include some of the open parts. Thomas |
From: Joe S. <jo...@st...> - 2008-02-21 18:57:00
|
On Feb 21, 2008, at 11:36 AM, Thomas Tempelmann wrote: > What does the MIT License say about re-using the code in a closed- > source > app? It's fine with it. > E.g, while I like to contribute most of my work to other for free, > I am also > working on a "pro" app which I like to sell (it shall be an RB > source code > editor with features for project comparison, mainly). I see no problem with that. > I mean, as much as I like the open source idea in general, in this > case, I'd > prefer a solution where only those parts that we exchange have to > be kept > open, not the entire projects that include some of the open parts. Right, me too. Best, - Joe |
From: Seth V. <se...@bk...> - 2008-02-21 16:56:08
|
>> >> It seems like having this functionality in the lexer would limit the >> implementation options. Unless you're just talking about syntactic >> sugar for skipping tokens until an endofline is reached, this seems >> to imply a line oriented buffering strategy. > > No, I don't think it implies anything about the buffering. In UTF-8, > an end of line character can't occur as part of any other character. > So if you have one big continuous buffer, then the skip-to-next-line > method can simply scan ahead in the buffer looking for the next EOL. > This would be quite a bit faster than grabbing and discarding tokens > along the way. Excellent point. > >>> - its position and extent in the source chunk (in bytes or >>> characters?) >> >> Deciding this in advance also seems unnecessarily limiting. > > But this is absolutely necessary, since it's part of the lexer > interface. We should be able to swap out one lexer for another (e.g. > because we decide to move lexing to a plugin someday) without > breaking the rest of the system. But that means that, if we provide > positions at all, we have to specify in advance whether those are > characters or bytes. > > ...But, on the other hand, there's no telling which the caller is > going to need. It depends entirely on what they're going to do with > that information. It's always possible to convert from one to the > other, but this conversion can be expensive (at least when going from > bytes to chars). I was thinking more along the lines of treating the position as an abstract type. If you have routines to extract a substring and replace the text between two positions, what difference does it make if it's bytes or characters? That works well for the uses I've got in mind, but other people might have different ideas :) The caveat would be that people would have to resist the urge to assume that it's one or the other. > > Another thing to think about: should the token positions be absolute > positions relative to the entire source chunk the lexer was given -- > or should they be relative to the start of the line they're on? > Positions within the line are usually more useful, since almost > anything you would do with a lexer (syntax coloring, code formatting, > even compiling) tends to work on either the line or the statement > level. But at the moment, I'm inclined to feel that the lexer > shouldn't be worrying about this -- it's easy enough for the caller > to keep track of line or statement positions itself (especially if > the lexer returns a start-of-line or end-of-line token). I prefer to have the positions be absolute. It seems like it will be easier that way down the line, because the positions can be propagated to the AST and then used to do substitutions back into the original code based on modifications to the tree. The position within the line is better for error messages and such, but that can be handled pretty easily as you pointed out. > > Best, > - Joe > > > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Rbmetacode-list mailing list > Rbm...@li... > https://lists.sourceforge.net/lists/listinfo/rbmetacode-list > |