On Feb 21, 2008, at 7:35 AM, Seth Verrinder wrote:
> It might be easier to decide some of these questions if we can see
> what's in progress. Do we want to exchange early prototype code on
> this list or set up an area in subversion for that?
We should use subversion. I like your idea of an early prototype
area, and then perhaps a main development area where we integrate
> I've got a lexer that I think addresses most of the areas below and
> shouldn't be difficult to extend, it's also pretty quick, I think,
> but if there's a faster way I'm open to suggestions.
From what I've seen, I agree, yours does 95% of what we need, and
very fast too. However, at the design stage I prefer to focus on
what we want, rather than what we have, to avoid settling for what we
have rather than what we really want. :)
I'll try to get that subversion area set up today. Seth, are you OK
releasing the code that you sent me earlier under the MIT license?
>> and if the "chunk" includes multiple lines, it
>> should be able to skip to the next line (so that if you're looking
>> for an "End Method" you can skip any lines that don't start with
> It seems like having this functionality in the lexer would limit the
> implementation options. Unless you're just talking about syntactic
> sugar for skipping tokens until an endofline is reached, this seems
> to imply a line oriented buffering strategy.
No, I don't think it implies anything about the buffering. In UTF-8,
an end of line character can't occur as part of any other character.
So if you have one big continuous buffer, then the skip-to-next-line
method can simply scan ahead in the buffer looking for the next EOL.
This would be quite a bit faster than grabbing and discarding tokens
along the way.
> That may be a good way of buffering input, but it doesn't seem like
> we need to settle on
> that in advance.
Agreed, implementation details should be hidden and mostly irrelevant
from the outside.
>> The information we need about any token would be:
>> - its type (identifier, operator, string literal, color literal,
>> - its value (which operator, what string, etc.) -- or maybe just
>> its raw text?
> Currently I'm just storing raw text, although it wouldn't be very
> difficult to store the decoded value. For what we're doing, it seems
> like the raw text is enough. Since we're not actually compiling the
> code, it isn't guaranteed that we'll even need the value in any other
Good point. I'm convinced.
>> - its position and extent in the source chunk (in bytes or
> Deciding this in advance also seems unnecessarily limiting.
But this is absolutely necessary, since it's part of the lexer
interface. We should be able to swap out one lexer for another (e.g.
because we decide to move lexing to a plugin someday) without
breaking the rest of the system. But that means that, if we provide
positions at all, we have to specify in advance whether those are
characters or bytes.
...But, on the other hand, there's no telling which the caller is
going to need. It depends entirely on what they're going to do with
that information. It's always possible to convert from one to the
other, but this conversion can be expensive (at least when going from
bytes to chars).
I'm starting to wonder if maybe the lexer should keep track of both;
it's iterating over both bytes and characters already, so this
wouldn't be a lot of extra work for it. Then the caller can use
whichever it needs, and we wash ourselves of that headache early on.
Another thing to think about: should the token positions be absolute
positions relative to the entire source chunk the lexer was given --
or should they be relative to the start of the line they're on?
Positions within the line are usually more useful, since almost
anything you would do with a lexer (syntax coloring, code formatting,
even compiling) tends to work on either the line or the statement
level. But at the moment, I'm inclined to feel that the lexer
shouldn't be worrying about this -- it's easy enough for the caller
to keep track of line or statement positions itself (especially if
the lexer returns a start-of-line or end-of-line token).