Thread: Re: [CEDET-devel] generic lexical analyzers

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Eric,

[...]
>   That's a pretty interesting idea.  Your extraction and use of the
> existing analyzer is quite clever.  I had asked about the API layers
> in a previous email.  It seems that the derived lexical analyzer is
> still a part of the core lexical API as opposed in some intermediate
> layer.  That's probably fine.  There seems to be a lot of lexical
> generated tables and code already.
> 
>   In your code:
> 
> 
>>             ;; Search for a matching lexical token
>>             (while (and ,lst (not ,elt))
>>               (setq ,elt (and (string-match (cdar ,lst) ,val) (caar ,lst))
>>                     ,lst (cdr ,lst)))
> 
> 
>   would an obarray or hash table be better?  The keyword table is
> quite successful.  I know that in your sample you are trying to match
> "^$" as VAR.  That feature is important, but I think that explicit
> string matches is more common and could be made faster for the
> punctuation types.  Something separate for symbols and lists may be in
> order.

You're right.  That's funny because I already implemented a similar
solution in the old `wisent-flex' lexer.  Perhaps could we use the
same approach here. To distinguish between string and regexp matches,
`wisent-flex' used properties of symbols in the token table (which is
an obarray of the token type symbols).

By default certain token types, like punctuation, were setup to use
string matches (this is the purpose of `wisent-lex-make-token-table'
compared to stock `semantic-lex-make-type-table', but it will be
easy to do that in `semantic-lex-make-type-table' and remove
`wisent-lex-make-token-table').

The advantage of that design is its simplicity, and especially that
it allows customization using grammar %PUT statements.

For example you could have:

%token <punctuation> COMMA ","
%token <punctuation> EQ    "="

By default it is assumed that there is an implicit

%PUT punctuation string t

which, for speed, indicates to recognize punctuation using string
matches (a la `semantic-lex-punctuation-type').

But you could have also something like this:

%token <punctuation> COMPARATOR "[<>][=]?"
%put punctuation string nil

that indicates to use regexp matches to recognize punctuation.

Depending on the `string' property of the token type symbol, it should
be easy for`define-derived-lex-type-analyzer' to generate the ad-hoc
match algorithm.

>   Question: Couldn't there now be several default analyzers for each
> of the major lexical types like punctuation?  It appears that when
> generated by the macro runs, it is important for the lexical table of
> types to be active.  Thus, a macro generated with
> `define-derived-lex-type-analyzer' could be used in any language.

That's my goal ;-) And that's why I call such analyzers "generic",
because they don't depend on the language, but on what the current
syntax and token tables provide. 

>   Also, it appears this would not work for compound tokens like "=>"
> as this analyzer would only work in character groups defined by the
> originating analyzer.  Is this assumption true?

I don't think so.  The "string matches" algorithm used in
`semantic-lex-punctuation-type' is particularly adapted to match
compound punctuations ;-)

>   Anyway, I think it looks good.  Please check it in.  Thanks.

OK, I just prefer to wait a little for your feedback on my proposal
above about the use of properties to drive the matching method.

David

Thread: Re: [CEDET-devel] generic lexical analyzers

cedet-devel