[CEDET-devel] Re: [cedet-semantic] Newbie adventures

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Eric,

(I moved this thread to cedet-devel which seems more appropriate ;-)

[...]
>>2. The LALR parser is entered, it calls wisent-lex each time it needs
>>  a lexical token.
> 
> 
> I would not be opposed to make this type of functionality available
> from the core lex support code.  The fact that the lexical step
> analyzes the entire stream at once is a mechanation on the core
> analyzer that creates one token at a time.

I think the current design is a good for speed.  Entering semantic-lex
to obtain each lexical element will sensibly slow down lexical
analysis.  This is less critical for wisent-lex whose code is very
simple and fast:

(define-wisent-lexer wisent-lex
  "Return the next available lexical token in Wisent's form.
The variable `wisent-lex-istream' contains the list of lexical tokens
produced by `semantic-lex'.  Pop the next token available and convert
it to a form suitable for the Wisent's parser."
  (let* ((tk (car wisent-lex-istream)))
    ;; Eat input stream
    (setq wisent-lex-istream (cdr wisent-lex-istream))
    (cons (semantic-lex-token-class tk)
          (cons (semantic-lex-token-text tk)
                (semantic-lex-token-bounds tk)))))

>>3. Each time wisent-lex is called, it pops a semantic lexical token
>>  from the stream obtained in step 1 above.  It translates it in a
>>  form understandable by wisent, and returns that form.  More
>>  precisely:
>>
>>  semantic-lex form            wisent-lex form
>>  -------------------------    -------------------------------------
>>  (TOKEN-CLASS START . END) -> (TOKEN-CLASS TOKEN-VALUE START . END)
>>
> 
>   [ ... ]
> 
> Is TOKEN-VALUE different (as in a string or a number) for different
> values of TOKEN-CLASS?  Should each analyzer be responsible for also
> providing a value?

TOKEN-VALUE is different for each token.  The analyzer is just
responsible for providing the TOKEN-CLASS and the bounds of the
TOKEN-VALUE.

> It could be useful to also have the default output of semantic-lex
> match what you are using in wisent.  All features of the token
> (class, start, end and value) already have accessor functions so it
> should have no effect on token stream consumers.
> 
> The reason I did not put a TOKEN-VALUE into the original lexical token
> (semantic v 0.1) was because some lexical entities, such as comments,
> strings, and lists would have very large values of very small worth
> (meaning they were seldom queried.)  Avoiding that made it faster (on
> a 486 50Mhz.) You may find that solving that type of problem (if it
> exists in wisent) could make wisent's lexical step a bit speedier.

The main difference between semantic-lex and a wisent lexer is that
the former is buffer oriented whereas the latter is completely
independent of the lexical source.  For example wisent can be used to
parse a string (this is what wisent-expr does for example).
This is why token bounds are optionals in a wisent token, whereas
token values are mandatory.

Also, token values are pushed into and retrieved from the parser stack
in order to be passed to semantic actions as $n values.  IMO having
token values as (start . end) elements will make wisent dependent on
buffer as input stream, and will probably increase the complexity of
semantic actions (and slow down parsing?) that should have to extract
token values from buffer's bounds.

__ For example, the LL parser handles itself the extraction of token
values (in `semantic-bovinate-stream'), before calling semantic
actions.  IMO it is a better design to only have the lexical analyzer
depend on the nature of the input source. __

Finally, the fact that the wisent lexer is only called when the
parser need a new token, guaranties that token values will be
obtained only once, when necessary. 

In conclusion, I am not convinced that the current design of
semantic-lex should be changed.  That will not simplify nor probably
speed up LALR parsing.

On the contrary I would change the design of the LL parser to be more
closer to wisent's one.

The goal would be to have a common layer for buffer oriented lexical
analysis (semantic-lex).  Then an "on demand" lexer layer (a la
wisent-lex) that would handle token conversion from a buffer
representation into any form suitable for the target parser.

IMO, that would'nt have a big impact on performance, and it would be
possible to use the bovinator to parse other kind of input sources.

David