Re: [RBMetacode] Lexer thoughts
Status: Planning
Brought to you by:
jstrout
From: Joe S. <jo...@st...> - 2008-02-21 15:18:10
|
On Feb 21, 2008, at 7:48 AM, Thomas Tempelmann wrote: > Yes, exceptions are not smart in a lexer that has to deal with > input that is > likely to have errors in it. > > Here's how my Lexer solves it: When it encounters an error, e.g. a > string > literal with a missing quote sign at the end, it retuns a special > "error" > token. The parser can then skip simply to the next line to recover > from it. Yes, I've used that approach in the past too, but found that it tends to cause complications further down with subsequent tools. Say, for example, that you are using the lexer for syntax coloring. Just because somebody has typed x = &h3G04 + Val(foo) and &h3G04 is a malformed token, doesn't mean that we don't want to still correctly color Val, Foo, and the operator and parens. In fact, I'd prefer to see &h3G04 correctly colored like other hex literals too, just with a red dashed underline like a misspelled word in Mail or TextEdit. As another example, suppose we're feeding the above into a parser to generate an AST, which we're then going to use to standardize the spacing or whatever. If the lexer returns &h3G04 as a hex literal with a "malformed" flag set, the parser (which looks mainly at the token type) will continue to work, and we'll get a Statement node containing an Assignment with an Expression on the right-hand side, as usual. But if the lexer returns &h3G04 as an error token, then our parser will break, unless our grammar is riddled with special cases to handle an error at almost any point. We won't get back a proper AST node, and won't be able to format the line. Of course the case of a missing close quote is different, since that really does screw things up all the way to the end of the line. But there are other malformed tokens that are pretty easy to continue past, and it'd be nice to treat them in a uniform manner. > This works well because RB is a line-oriented language (as opposed > to C, for > instance) Yes, this is a real advantage for us -- it makes it easy to search for things that you know have to be at the start of a line, like block openers/closers. > Only care has to be taken that skipping to next line means that > lines with > "_" at their end needs to be skipped as well. Yes again, that's a wrinkle to be constantly aware of. Even RB doesn't always handle this feature very well (e.g., breakpoints on continued lines). Best, - Joe |