Re[2]: [CEDET-devel] generic lexical analyzers

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

>>> David PONCE <dav...@wa...> seems to think that:
  [ ... ]
>I thought more on how to take more benefit of what is put in the
>grammar to simplify writing of lexical analyzers, and I wonder if it
>would be worth exploring this new direction: directly generate
>analyzers in the <language>-[wb]y.el file.  Thus, the developer would
>have the opportunity to either use generated analyzers, or implement
>its own ones.

This seems like a good idea.

>The advantage would be a more efficient use of the existing lexical
>API, without the need of a second pass analysis.

Ah, speed is good too.

>For example we could imagine that these declarations in a foo.wy
>grammar:
>
>%token <symbol>      DOLLARVAR "^[$]"
>%token <symbol>      OTHERVAR
>
>%token <punctuation> EQ        "="
>%token <punctuation> NE        "^="
>%token <punctuation> GT        ">"
>%token <punctuation> GE        ">="
>
>would generate something like this in the foo-wy.el file:
>
>(define-lex-regex-type-analyzer foo-wy--symbol-analyzer
>  ;; regexp to grab symbol syntax
>  "\\(\\sw\\|\\s_\\)+"
>  ;; regexps to detect specific language symbols
>  ((DOLLARVAR . "^[$]"))
>  ;; Default token
>  OTHERVAR
>  "foo symbol regexp type analyzer.")
>
>(define-lex-string-type-analyzer foo-wy--punctuation-analyzer
>  ;; regexp to grab punctuation syntax
>  "\\(\\s.\\|\\s$\\|\\s'\\)+"
>  ;; strings to detect specific language punctuations
>  '((EQ . "=")                
>    (NE . "^=")
>    (GT . ">")
>    (GE . ">="))
>  ;; Default token
>  'punctuation
>  "foo punctuation string type analyzer.")

Adding this would then allow you to remove the existing wisent only
compounding mechanism for these symbols.  It seems unlikely anyone
would want to use any other mechanism for creating specific analyzers
of this nature.

Even so, it might be worth having command in the grammar that states:

%lex <punctuation> my-analyzer

or some such in case of naming conflicts, though that could be
unlikely.

If someone doesn't want the auto-generated analyzer, they could skip
adding such a command.

>Using a bison like %type statement we could give properties to a
><type> (and use them at generation time) like this:
>
>%type <symbol> syntax "\\(\\sw\\|\\s_\\)+"
>               matchdatatype regexp
>
>%type <punctuation> syntax "\\(\\s.\\|\\s$\\|\\s'\\)+"
>                    matchdatatype string
>
>Properties would give the syntax regexp to use to grab a sequence of
><type> characters, and the matchdatatype algorithm to use to match
>specific tokens.  Other properties can be imagined for other
>situations (block analysis, etc.).

I like the idea of allowing a declaration of specific syntax
regexps.  In C, I suppose I could have:

%type <ifdef> syntax "^#ifdef"
              matchdatatype string

too?

>For well known <type>, like <symbol>, <punctuation>, etc., we could
>provide a default property list.  Overriding properties would be

Having good defaults is important too.  The use of syntax tables will
make it unnecessary to specify a regexp most of the time.

>achieved by merging the default property list and the one specified
>by the %type statement.

I'm a little concerned about using the name "%type" for the command.
Users starting with a bison background could be confused by this
since it is really just a fancy form of "%put".  Unfortunately I do
not know what a good alternative would be.

For example, I might expect:

%type <symbol> "[0-9]+" wholenump

or something like that.

>Keywords would be handled specifically using a built-in
>`semantic-lex-keyword' analyzer that should be put before other symbol
>analyzers in the lexer definition.  Consequently the
>%put statement would be exclusively reserved for keyword properties.

Using:

%token THINGY "thingy"
%put THINGY summary "A useful thingy"

makes sense.  I wonder if:

%token <punctuation> EQUALEQUAL "=="
%put EQUALEQUAL summary "test for equivalence"

could also be useful iff we update things so eldoc can comment on ==.

Here are some other possibilities that makes more sense with %put
than %type

%put COMMA argumentseparator t
%put DOT typerelationseparator t

or, perhaps it would be better to have:

%set function-argument-separation-character COMMA

to declare variables in the lisp code.

Hmmm, perhaps it would be better to leave that in the Lisp code only.

>To summarize:
>
>- Keywords
>  
>  %keyword to define them (possibly using %token for compatibility)
>  %put to assign properties
>
>- Other tokens
>
>  %token to define them.
>  %type to assign properties
>
>So definitively less ambiguities, and more efficiency.
>Oops!  And a lot of things to do ;-)
>
>What do you think?
  [ ... ]

I think this sounds like a good idea.

Eric

-- 
          Eric Ludlam:                 za...@gn..., er...@si...
   Home: http://www.ludlam.net            Siege: www.siege-engine.com
Emacs: http://cedet.sourceforge.net               GNU: www.gnu.org