Re: [RBMetacode] Lexer thoughts

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

>>
>> It seems like having this functionality in the lexer would limit the
>> implementation options. Unless you're just talking about syntactic
>> sugar for skipping tokens until an endofline is reached, this seems
>> to imply a line oriented buffering strategy.
>
> No, I don't think it implies anything about the buffering.  In UTF-8,
> an end of line character can't occur as part of any other character.
> So if you have one big continuous buffer, then the skip-to-next-line
> method can simply scan ahead in the buffer looking for the next EOL.
> This would be quite a bit faster than grabbing and discarding tokens
> along the way.

Excellent point.

>
>>>    - its position and extent in the source chunk (in bytes or
>>> characters?)
>>
>> Deciding this in advance also seems unnecessarily limiting.
>
> But this is absolutely necessary, since it's part of the lexer
> interface.  We should be able to swap out one lexer for another (e.g.
> because we decide to move lexing to a plugin someday) without
> breaking the rest of the system.  But that means that, if we provide
> positions at all, we have to specify in advance whether those are
> characters or bytes.
>
> ...But, on the other hand, there's no telling which the caller is
> going to need.  It depends entirely on what they're going to do with
> that information.  It's always possible to convert from one to the
> other, but this conversion can be expensive (at least when going from
> bytes to chars).

I was thinking more along the lines of treating the position as an  
abstract type. If you have routines to extract a substring and  
replace the text between two positions, what difference does it make  
if it's bytes or characters? That works well for the uses I've got in  
mind, but other people might have different ideas :)

The caveat would be that people would have to resist the urge to  
assume that it's one or the other.

>
> Another thing to think about: should the token positions be absolute
> positions relative to the entire source chunk the lexer was given --
> or should they be relative to the start of the line they're on?
> Positions within the line are usually more useful, since almost
> anything you would do with a lexer (syntax coloring, code formatting,
> even compiling) tends to work on either the line or the statement
> level.  But at the moment, I'm inclined to feel that the lexer
> shouldn't be worrying about this -- it's easy enough for the caller
> to keep track of line or statement positions itself (especially if
> the lexer returns a start-of-line or end-of-line token).

I prefer to have the positions be absolute. It seems like it will be  
easier that way down the line, because the positions can be  
propagated to the AST and then used to do substitutions back into the  
original code based on modifications to the tree. The position within  
the line is better for error messages and such, but that can be  
handled pretty easily as you pointed out.

>
> Best,
> - Joe
>
>
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Rbmetacode-list mailing list
> Rbm...@li...
> https://lists.sourceforge.net/lists/listinfo/rbmetacode-list
>