Re: [RBMetacode] Lexer thoughts
Status: Planning
Brought to you by:
jstrout
From: Seth V. <se...@bk...> - 2008-02-21 16:56:08
|
>> >> It seems like having this functionality in the lexer would limit the >> implementation options. Unless you're just talking about syntactic >> sugar for skipping tokens until an endofline is reached, this seems >> to imply a line oriented buffering strategy. > > No, I don't think it implies anything about the buffering. In UTF-8, > an end of line character can't occur as part of any other character. > So if you have one big continuous buffer, then the skip-to-next-line > method can simply scan ahead in the buffer looking for the next EOL. > This would be quite a bit faster than grabbing and discarding tokens > along the way. Excellent point. > >>> - its position and extent in the source chunk (in bytes or >>> characters?) >> >> Deciding this in advance also seems unnecessarily limiting. > > But this is absolutely necessary, since it's part of the lexer > interface. We should be able to swap out one lexer for another (e.g. > because we decide to move lexing to a plugin someday) without > breaking the rest of the system. But that means that, if we provide > positions at all, we have to specify in advance whether those are > characters or bytes. > > ...But, on the other hand, there's no telling which the caller is > going to need. It depends entirely on what they're going to do with > that information. It's always possible to convert from one to the > other, but this conversion can be expensive (at least when going from > bytes to chars). I was thinking more along the lines of treating the position as an abstract type. If you have routines to extract a substring and replace the text between two positions, what difference does it make if it's bytes or characters? That works well for the uses I've got in mind, but other people might have different ideas :) The caveat would be that people would have to resist the urge to assume that it's one or the other. > > Another thing to think about: should the token positions be absolute > positions relative to the entire source chunk the lexer was given -- > or should they be relative to the start of the line they're on? > Positions within the line are usually more useful, since almost > anything you would do with a lexer (syntax coloring, code formatting, > even compiling) tends to work on either the line or the statement > level. But at the moment, I'm inclined to feel that the lexer > shouldn't be worrying about this -- it's easy enough for the caller > to keep track of line or statement positions itself (especially if > the lexer returns a start-of-line or end-of-line token). I prefer to have the positions be absolute. It seems like it will be easier that way down the line, because the positions can be propagated to the AST and then used to do substitutions back into the original code based on modifications to the tree. The position within the line is better for error messages and such, but that can be handled pretty easily as you pointed out. > > Best, > - Joe > > > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Rbmetacode-list mailing list > Rbm...@li... > https://lists.sourceforge.net/lists/listinfo/rbmetacode-list > |