Re: [RBMetacode] Lexer thoughts
Status: Planning
Brought to you by:
jstrout
From: Joe S. <jo...@st...> - 2008-02-21 15:52:44
|
On Feb 21, 2008, at 7:35 AM, Seth Verrinder wrote: > It might be easier to decide some of these questions if we can see > what's in progress. Do we want to exchange early prototype code on > this list or set up an area in subversion for that? We should use subversion. I like your idea of an early prototype area, and then perhaps a main development area where we integrate things. > I've got a lexer that I think addresses most of the areas below and > shouldn't be difficult to extend, it's also pretty quick, I think, > but if there's a faster way I'm open to suggestions. From what I've seen, I agree, yours does 95% of what we need, and very fast too. However, at the design stage I prefer to focus on what we want, rather than what we have, to avoid settling for what we have rather than what we really want. :) I'll try to get that subversion area set up today. Seth, are you OK releasing the code that you sent me earlier under the MIT license? >> and if the "chunk" includes multiple lines, it >> should be able to skip to the next line (so that if you're looking >> for an "End Method" you can skip any lines that don't start with >> that). > > It seems like having this functionality in the lexer would limit the > implementation options. Unless you're just talking about syntactic > sugar for skipping tokens until an endofline is reached, this seems > to imply a line oriented buffering strategy. No, I don't think it implies anything about the buffering. In UTF-8, an end of line character can't occur as part of any other character. So if you have one big continuous buffer, then the skip-to-next-line method can simply scan ahead in the buffer looking for the next EOL. This would be quite a bit faster than grabbing and discarding tokens along the way. > That may be a good way of buffering input, but it doesn't seem like > we need to settle on > that in advance. Agreed, implementation details should be hidden and mostly irrelevant from the outside. >> The information we need about any token would be: >> - its type (identifier, operator, string literal, color literal, >> etc.) >> - its value (which operator, what string, etc.) -- or maybe just >> its raw text? > > Currently I'm just storing raw text, although it wouldn't be very > difficult to store the decoded value. For what we're doing, it seems > like the raw text is enough. Since we're not actually compiling the > code, it isn't guaranteed that we'll even need the value in any other > form. Good point. I'm convinced. >> - its position and extent in the source chunk (in bytes or >> characters?) > > Deciding this in advance also seems unnecessarily limiting. But this is absolutely necessary, since it's part of the lexer interface. We should be able to swap out one lexer for another (e.g. because we decide to move lexing to a plugin someday) without breaking the rest of the system. But that means that, if we provide positions at all, we have to specify in advance whether those are characters or bytes. ...But, on the other hand, there's no telling which the caller is going to need. It depends entirely on what they're going to do with that information. It's always possible to convert from one to the other, but this conversion can be expensive (at least when going from bytes to chars). I'm starting to wonder if maybe the lexer should keep track of both; it's iterating over both bytes and characters already, so this wouldn't be a lot of extra work for it. Then the caller can use whichever it needs, and we wash ourselves of that headache early on. Another thing to think about: should the token positions be absolute positions relative to the entire source chunk the lexer was given -- or should they be relative to the start of the line they're on? Positions within the line are usually more useful, since almost anything you would do with a lexer (syntax coloring, code formatting, even compiling) tends to work on either the line or the statement level. But at the moment, I'm inclined to feel that the lexer shouldn't be worrying about this -- it's easy enough for the caller to keep track of line or statement positions itself (especially if the lexer returns a start-of-line or end-of-line token). Best, - Joe |