Re: [RBMetacode] Lexer thoughts
Status: Planning
Brought to you by:
jstrout
From: Seth V. <se...@bk...> - 2008-02-21 14:35:13
|
It might be easier to decide some of these questions if we can see what's in progress. Do we want to exchange early prototype code on this list or set up an area in subversion for that? I've got a lexer that I think addresses most of the areas below and shouldn't be difficult to extend, it's also pretty quick, I think, but if there's a faster way I'm open to suggestions. On Feb 20, 2008, at 10:03 PM, Joe Strout wrote: > > - Syntax coloring/styling > - Feeding tokens to a parser > (which may or may not need noncoding tokens, e.g. whitespace and > comments) > - Edit-time error reporting (i.e. notifying the user of malformed > tokens right away) > - Other fancy editor features (code folding, paren matching, etc.) > - Declaration mining (i.e. finding declared variables, classes, > methods, etc.) > and if the "chunk" includes multiple lines, it > should be able to skip to the next line (so that if you're looking > for an "End Method" you can skip any lines that don't start with > that). It seems like having this functionality in the lexer would limit the implementation options. Unless you're just talking about syntactic sugar for skipping tokens until an endofline is reached, this seems to imply a line oriented buffering strategy. That may be a good way of buffering input, but it doesn't seem like we need to settle on that in advance. > > The information we need about any token would be: > - its type (identifier, operator, string literal, color literal, > etc.) > - its value (which operator, what string, etc.) -- or maybe just > its raw text? Currently I'm just storing raw text, although it wouldn't be very difficult to store the decoded value. For what we're doing, it seems like the raw text is enough. Since we're not actually compiling the code, it isn't guaranteed that we'll even need the value in any other form. > - its position and extent in the source chunk (in bytes or > characters?) Deciding this in advance also seems unnecessarily limiting. If the lexer uses RB's built-in string handling, then characters are going to be the natural choice. If the lexer uses a MemoryBlock, then bytes are going to be the natural choice. > - whether it is malformed (e.g. &h3G, or 123.345.7) Hmm, I can see that a unit testing facility is going to come in handy. My code will translate those as a hexLiterial followed by an identifer, and two floating point numbers. They would be caught as invalid syntax by the parser, though. > > I don't think we need the lexer itself to do any actual error > reporting; it should probably just return the next token, whatever it > is, but note (via that last flag in the list above) if the token > appears invalid. I agree. I started out throwing exceptions on illegal input, but that's not really going to play nice with the parser. > > What do y'all think? Are we missing anything? > > Best, > - Joe > > > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Rbmetacode-list mailing list > Rbm...@li... > https://lists.sourceforge.net/lists/listinfo/rbmetacode-list > |