Re: [RBMetacode] Lexer thoughts

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Feb 21, 2008, at 7:35 AM, Seth Verrinder wrote:

> It might be easier to decide some of these questions if we can see
> what's in progress. Do we want to exchange early prototype code on
> this list or set up an area in subversion for that?

We should use subversion.  I like your idea of an early prototype  
area, and then perhaps a main development area where we integrate  
things.

> I've got a lexer that I think addresses most of the areas below and
> shouldn't be difficult to extend, it's also pretty quick, I think,
> but if there's a faster way I'm open to suggestions.

 From what I've seen, I agree, yours does 95% of what we need, and  
very fast too.  However, at the design stage I prefer to focus on  
what we want, rather than what we have, to avoid settling for what we  
have rather than what we really want.  :)

I'll try to get that subversion area set up today.  Seth, are you OK  
releasing the code that you sent me earlier under the MIT license?

>>  and if the "chunk" includes multiple lines, it
>> should be able to skip to the next line (so that if you're looking
>> for an "End Method" you can skip any lines that don't start with
>> that).
>
> It seems like having this functionality in the lexer would limit the
> implementation options. Unless you're just talking about syntactic
> sugar for skipping tokens until an endofline is reached, this seems
> to imply a line oriented buffering strategy.

No, I don't think it implies anything about the buffering.  In UTF-8,  
an end of line character can't occur as part of any other character.   
So if you have one big continuous buffer, then the skip-to-next-line  
method can simply scan ahead in the buffer looking for the next EOL.   
This would be quite a bit faster than grabbing and discarding tokens  
along the way.

> That may be a good way of buffering input, but it doesn't seem like  
> we need to settle on
> that in advance.

Agreed, implementation details should be hidden and mostly irrelevant  
from the outside.

>> The information we need about any token would be:
>>    - its type (identifier, operator, string literal, color literal,
>> etc.)
>>    - its value (which operator, what string, etc.) -- or maybe just
>> its raw text?
>
> Currently I'm just storing raw text, although it wouldn't be very
> difficult to store the decoded value. For what we're doing, it seems
> like the raw text is enough. Since we're not actually compiling the
> code, it isn't guaranteed that we'll even need the value in any other
> form.

Good point.  I'm convinced.

>>    - its position and extent in the source chunk (in bytes or
>> characters?)
>
> Deciding this in advance also seems unnecessarily limiting.

But this is absolutely necessary, since it's part of the lexer  
interface.  We should be able to swap out one lexer for another (e.g.  
because we decide to move lexing to a plugin someday) without  
breaking the rest of the system.  But that means that, if we provide  
positions at all, we have to specify in advance whether those are  
characters or bytes.

...But, on the other hand, there's no telling which the caller is  
going to need.  It depends entirely on what they're going to do with  
that information.  It's always possible to convert from one to the  
other, but this conversion can be expensive (at least when going from  
bytes to chars).

I'm starting to wonder if maybe the lexer should keep track of both;  
it's iterating over both bytes and characters already, so this  
wouldn't be a lot of extra work for it.  Then the caller can use  
whichever it needs, and we wash ourselves of that headache early on.

Another thing to think about: should the token positions be absolute  
positions relative to the entire source chunk the lexer was given --  
or should they be relative to the start of the line they're on?   
Positions within the line are usually more useful, since almost  
anything you would do with a lexer (syntax coloring, code formatting,  
even compiling) tends to work on either the line or the statement  
level.  But at the moment, I'm inclined to feel that the lexer  
shouldn't be worrying about this -- it's easy enough for the caller  
to keep track of line or statement positions itself (especially if  
the lexer returns a start-of-line or end-of-line token).

Best,
- Joe