Re: [RBMetacode] Lexer thoughts

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Feb 21, 2008, at 7:48 AM, Thomas Tempelmann wrote:

> Yes, exceptions are not smart in a lexer that has to deal with  
> input that is
> likely to have errors in it.
>
> Here's how my Lexer solves it: When it encounters an error, e.g. a  
> string
> literal with a missing quote sign at the end, it retuns a special  
> "error"
> token. The parser can then skip simply to the next line to recover  
> from it.

Yes, I've used that approach in the past too, but found that it tends  
to cause complications further down with subsequent tools.  Say, for  
example, that you are using the lexer for syntax coloring.  Just  
because somebody has typed

  x = &h3G04 + Val(foo)

and &h3G04 is a malformed token, doesn't mean that we don't want to  
still correctly color Val, Foo, and the operator and parens.  In  
fact, I'd prefer to see &h3G04 correctly colored like other hex  
literals too, just with a red dashed underline like a misspelled word  
in Mail or TextEdit.

As another example, suppose we're feeding the above into a parser to  
generate an AST, which we're then going to use to standardize the  
spacing or whatever.  If the lexer returns &h3G04 as a hex literal  
with a "malformed" flag set, the parser (which looks mainly at the  
token type) will continue to work, and we'll get a Statement node  
containing an Assignment with an Expression on the right-hand side,  
as usual.  But if the lexer returns &h3G04 as an error token, then  
our parser will break, unless our grammar is riddled with special  
cases to handle an error at almost any point.  We won't get back a  
proper AST node, and won't be able to format the line.

Of course the case of a missing close quote is different, since that  
really does screw things up all the way to the end of the line.  But  
there are other malformed tokens that are pretty easy to continue  
past, and it'd be nice to treat them in a uniform manner.

> This works well because RB is a line-oriented language (as opposed  
> to C, for
> instance)

Yes, this is a real advantage for us -- it makes it easy to search  
for things that you know have to be at the start of a line, like  
block openers/closers.

> Only care has to be taken that skipping to next line means that  
> lines with
> "_" at their end needs to be skipped as well.

Yes again, that's a wrinkle to be constantly aware of.  Even RB  
doesn't always handle this feature very well (e.g., breakpoints on  
continued lines).

Best,
- Joe