Re: [RBMetacode] Lexer thoughts

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

It might be easier to decide some of these questions if we can see  
what's in progress. Do we want to exchange early prototype code on  
this list or set up an area in subversion for that?

I've got a lexer that I think addresses most of the areas below and  
shouldn't be difficult to extend, it's also pretty quick, I think,  
but if there's a faster way I'm open to suggestions.

On Feb 20, 2008, at 10:03 PM, Joe Strout wrote:
>
> - Syntax coloring/styling
> - Feeding tokens to a parser
> 	(which may or may not need noncoding tokens, e.g. whitespace and
> comments)
> - Edit-time error reporting (i.e. notifying the user of malformed
> tokens right away)
> - Other fancy editor features (code folding, paren matching, etc.)
> - Declaration mining (i.e. finding declared variables, classes,
> methods, etc.)

>  and if the "chunk" includes multiple lines, it
> should be able to skip to the next line (so that if you're looking
> for an "End Method" you can skip any lines that don't start with  
> that).

It seems like having this functionality in the lexer would limit the  
implementation options. Unless you're just talking about syntactic  
sugar for skipping tokens until an endofline is reached, this seems  
to imply a line oriented buffering strategy. That may be a good way  
of buffering input, but it doesn't seem like we need to settle on  
that in advance.

>
> The information we need about any token would be:
>    - its type (identifier, operator, string literal, color literal,
> etc.)
>    - its value (which operator, what string, etc.) -- or maybe just
> its raw text?

Currently I'm just storing raw text, although it wouldn't be very  
difficult to store the decoded value. For what we're doing, it seems  
like the raw text is enough. Since we're not actually compiling the  
code, it isn't guaranteed that we'll even need the value in any other  
form.

>    - its position and extent in the source chunk (in bytes or
> characters?)

Deciding this in advance also seems unnecessarily limiting. If the  
lexer uses RB's built-in string handling, then characters are going  
to be the natural choice. If the lexer uses a MemoryBlock, then bytes  
are going to be the natural choice.

>    - whether it is malformed (e.g. &h3G, or 123.345.7)

Hmm, I can see that a unit testing facility is going to come in  
handy. My code will translate those as a hexLiterial  followed by an  
identifer, and two floating point numbers. They would be caught as  
invalid syntax by the parser, though.

>
> I don't think we need the lexer itself to do any actual error
> reporting; it should probably just return the next token, whatever it
> is, but note (via that last flag in the list above) if the token
> appears invalid.

I agree. I started out throwing exceptions on illegal input, but  
that's not really going to play nice with the parser.

>
> What do y'all think?  Are we missing anything?
>
> Best,
> - Joe
>
>
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Rbmetacode-list mailing list
> Rbm...@li...
> https://lists.sourceforge.net/lists/listinfo/rbmetacode-list
>