Thread: [RBMetacode] Lexer thoughts

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Getting down to business now: let's start with the lexer.

We have several good lexers on hand already, so this should be mostly  
a matter of checking with the copyright holders, and adapting one or  
more of these to a common interface (plus fixing any errors the unit  
tests may uncover).  We do need to figure out what that interface  
should be, though, by considering first the uses to which our lexer  
may be put:

- Syntax coloring/styling
- Feeding tokens to a parser
	(which may or may not need noncoding tokens, e.g. whitespace and  
comments)
- Edit-time error reporting (i.e. notifying the user of malformed  
tokens right away)
- Other fancy editor features (code folding, paren matching, etc.)
- Declaration mining (i.e. finding declared variables, classes,  
methods, etc.)

So, we want a lexer that is fast and correct, but also has the  
capability of supporting all of the above functions.  The typical  
lexer paradigm is to give it a chunk of source, and then ask it for a  
token at a time; and if the "chunk" includes multiple lines, it  
should be able to skip to the next line (so that if you're looking  
for an "End Method" you can skip any lines that don't start with that).

The information we need about any token would be:
   - its type (identifier, operator, string literal, color literal,  
etc.)
   - its value (which operator, what string, etc.) -- or maybe just  
its raw text?
   - its position and extent in the source chunk (in bytes or  
characters?)
   - whether it is malformed (e.g. &h3G, or 123.345.7)

I don't think we need the lexer itself to do any actual error  
reporting; it should probably just return the next token, whatever it  
is, but note (via that last flag in the list above) if the token  
appears invalid.

What do y'all think?  Are we missing anything?

Best,
- Joe

Thread: [RBMetacode] Lexer thoughts

rbmetacode-list