Thread: Re: [CEDET-devel] generic lexical analyzers
Brought to you by:
zappo
From: David P. <dav...@wa...> - 2003-12-10 11:10:56
|
Hi Eric, [...] > That's a pretty interesting idea. Your extraction and use of the > existing analyzer is quite clever. I had asked about the API layers > in a previous email. It seems that the derived lexical analyzer is > still a part of the core lexical API as opposed in some intermediate > layer. That's probably fine. There seems to be a lot of lexical > generated tables and code already. > > In your code: > > >> ;; Search for a matching lexical token >> (while (and ,lst (not ,elt)) >> (setq ,elt (and (string-match (cdar ,lst) ,val) (caar ,lst)) >> ,lst (cdr ,lst))) > > > would an obarray or hash table be better? The keyword table is > quite successful. I know that in your sample you are trying to match > "^$" as VAR. That feature is important, but I think that explicit > string matches is more common and could be made faster for the > punctuation types. Something separate for symbols and lists may be in > order. You're right. That's funny because I already implemented a similar solution in the old `wisent-flex' lexer. Perhaps could we use the same approach here. To distinguish between string and regexp matches, `wisent-flex' used properties of symbols in the token table (which is an obarray of the token type symbols). By default certain token types, like punctuation, were setup to use string matches (this is the purpose of `wisent-lex-make-token-table' compared to stock `semantic-lex-make-type-table', but it will be easy to do that in `semantic-lex-make-type-table' and remove `wisent-lex-make-token-table'). The advantage of that design is its simplicity, and especially that it allows customization using grammar %PUT statements. For example you could have: %token <punctuation> COMMA "," %token <punctuation> EQ "=" By default it is assumed that there is an implicit %PUT punctuation string t which, for speed, indicates to recognize punctuation using string matches (a la `semantic-lex-punctuation-type'). But you could have also something like this: %token <punctuation> COMPARATOR "[<>][=]?" %put punctuation string nil that indicates to use regexp matches to recognize punctuation. Depending on the `string' property of the token type symbol, it should be easy for`define-derived-lex-type-analyzer' to generate the ad-hoc match algorithm. > Question: Couldn't there now be several default analyzers for each > of the major lexical types like punctuation? It appears that when > generated by the macro runs, it is important for the lexical table of > types to be active. Thus, a macro generated with > `define-derived-lex-type-analyzer' could be used in any language. That's my goal ;-) And that's why I call such analyzers "generic", because they don't depend on the language, but on what the current syntax and token tables provide. > Also, it appears this would not work for compound tokens like "=>" > as this analyzer would only work in character groups defined by the > originating analyzer. Is this assumption true? I don't think so. The "string matches" algorithm used in `semantic-lex-punctuation-type' is particularly adapted to match compound punctuations ;-) > Anyway, I think it looks good. Please check it in. Thanks. OK, I just prefer to wait a little for your feedback on my proposal above about the use of properties to drive the matching method. David |
From: David P. <dav...@wa...> - 2003-12-11 11:09:36
|
Hi Eric, [...] > That seems like a really good idea. Changing properties of lexical > symbols is what the %put command is all about. [...] >>%put punctuation string nil > > > Perhaps you mean: > > %put COMPARITOR string nil > ? I really meant "%put punctuation ..." In fact what I call the lexical token table is actually a lexical type table, that is an obarray of lexical type symbols which have token definitions as value. For example: %token <punctuation> COMMA "," %token <punctuation> SEMI ";" Declare the `punctuation' symbol in the table of token types, and give it the value: (nil (COMMA . ",") (SEMI . ";")) The first element represent a default token value returned when a punctuation doesn't match any of the values supplied in the alist part. It is declared like this: %token <punctuation> OTHERPUNCT ;; Notice there is no value And the `punctuation' symbol value then will be: (OTHERPUNCT (COMMA . ",") (SEMI . ";")) When a default value is not specified (nil), the analyzer should return the lexical type symbol (in this example: `punctuation') as default token. From a certain point of view the table of lexical types can be viewed as an extension of Emacs syntax tables when lexical types match Emacs syntax classes (punctuation, symbol, semantic-list, comment, string, etc.). IMO, for general purpose lexical tokens, that organization is more flexible than the one used for lexical keywords which are unique by definition (direct match between a value and a token). For general purpose tokens that can be obtained using regexp/string match using a full "hash-table" approach seems not practical. Also the notion of semantic lexical types match well with the notion of token data types used by bison. That could facilitate things ;-) [...] > %put THING string nil > > does not say "regexp" to me. Perhaps something like this: > > %put THING lexicalcomparetype string > > or > > %put THING matchdatatype regexp > > would be better? I agree that `string' is not a very good name. I my idea it designated the data type of token values (`string' versus `regexp'). I prefer `matchdatatype' which is closer to my initial idea. What do you thing of: %put lexical-type :matchdatatype string %put lexical-type :matchdatatype regexp (would be the default) Prefixing the property with a colon would be nice for syntax highlighting ;-) (It also needs a minor fix in semantic-grammar.el that I will check-in soon). [...] > I thought the entire raw lexical stream was compounded by the > wisent-lex layer. If you use the default punctuation analyzer, it > will only ever match a single character. You would need to extend > a different punctuation system that knows to combine => but not other > symbols that make no sense, like >=. Very good point. Probably to get `define-derived-lex-type-analyzer' works with punctuation we would need an alternate syntax analyzer that will grab a succession of punctuation characters. Something like: (define-lex-simple-regex-analyzer semantic-lex-compound-punctuation "Detect and create compound punctuation tokens." "\\(\\s.\\|\\s$\\|\\s'\\)+" 'punctuation) Otherwise it remains possible to directly use `semantic-lex-punctuation-type' which is fine at handling compound punctuations. > I like the direction your proposed function is going. Very nice. Thanks! David |
From: Eric M. L. <er...@si...> - 2003-12-11 14:20:57
|
>>> David PONCE <dav...@wa...> seems to think that: >Hi Eric, > >[...] >> That seems like a really good idea. Changing properties of lexical >> symbols is what the %put command is all about. >[...] >>>%put punctuation string nil >> >> >> Perhaps you mean: >> >> %put COMPARITOR string nil >> ? > >I really meant "%put punctuation ..." > >In fact what I call the lexical token table is actually a lexical >type table, that is an obarray of lexical type symbols which have >token definitions as value. > >For example: > >%token <punctuation> COMMA "," >%token <punctuation> SEMI ";" > >Declare the `punctuation' symbol in the table of token types, and >give it the value: > >(nil (COMMA . ",") (SEMI . ";")) > >The first element represent a default token value returned when a >punctuation doesn't match any of the values supplied in the alist >part. It is declared like this: > >%token <punctuation> OTHERPUNCT ;; Notice there is no value > >And the `punctuation' symbol value then will be: > >(OTHERPUNCT (COMMA . ",") (SEMI . ";")) > >When a default value is not specified (nil), the analyzer should >return the lexical type symbol (in this example: `punctuation') as >default token. Aha. It seemed a bit odd to me that you would be %PUTing something onto a symbol that was not declared with %token, but that makes sense. Explaining in the doc that the 'matchdatatype' property only affects this special token (which is often implied) as a means for identifying all other tokens in that class seems a but convoluted. Your explanation here makes sense to me, but I was confused at first. %put is the right way to do it IMHO, but perhaps there is a way that is more consistent. >From a certain point of view the table of lexical types can be viewed >as an extension of Emacs syntax tables when lexical types match Emacs >syntax classes (punctuation, symbol, semantic-list, comment, string, >etc.). Indeed, I've often thought that a C level built-in lexical analyzer as and Emacs built-in command would be very nice. As it stands, our lexical analyzer is pretty fast though. >IMO, for general purpose lexical tokens, that organization is more >flexible than the one used for lexical keywords which are unique by >definition (direct match between a value and a token). For general >purpose tokens that can be obtained using regexp/string match using a >full "hash-table" approach seems not practical. > >Also the notion of semantic lexical types match well with the notion >of token data types used by bison. That could facilitate things ;-) I agree. Consistency with bison helps with the learning of the new system. >[...] >> %put THING string nil >> >> does not say "regexp" to me. Perhaps something like this: >> >> %put THING lexicalcomparetype string >> >> or >> >> %put THING matchdatatype regexp >> >> would be better? > >I agree that `string' is not a very good name. I my idea it >designated the data type of token values (`string' versus `regexp'). >I prefer `matchdatatype' which is closer to my initial idea. >What do you thing of: > >%put lexical-type :matchdatatype string >%put lexical-type :matchdatatype regexp (would be the default) "matchdatatype" seems like a good word to me, unless compiler manuals use some other term for when text in a stream is matched lexically, or if it uses some other term for the type or style of the match. I do not recall any other such term from my days of a compiler writer. >Prefixing the property with a colon would be nice for syntax >highlighting ;-) (It also needs a minor fix in semantic-grammar.el >that I will check-in soon). The colon is a rather important operator / syntax element in the metagrammar. It seems a bit odd to use is as a symbol constituent. Of course, we have a dual language grammar of both Emacs Lisp and a LALR already so doing so depends on what this element belongs to. If the property in question is used most as a "slot" or field in Emacs Lisp, the colon is standard. If it is a variable that is relevant to the grammar itself, then $ seems like a more reasonable prefix. Another symbol prefix we've used is %, but that's for grammar declaration functions. Lastly, a naked symbol for other things. Hmmm, perhaps I convinced myself that : is a good prefix. >[...] >> I thought the entire raw lexical stream was compounded by the >> wisent-lex layer. If you use the default punctuation analyzer, it >> will only ever match a single character. You would need to extend >> a different punctuation system that knows to combine => but not other >> symbols that make no sense, like >=. > >Very good point. Probably to get `define-derived-lex-type-analyzer' >works with punctuation we would need an alternate syntax analyzer that >will grab a succession of punctuation characters. Something like: > >(define-lex-simple-regex-analyzer semantic-lex-compound-punctuation > "Detect and create compound punctuation tokens." > "\\(\\s.\\|\\s$\\|\\s'\\)+" 'punctuation) > >Otherwise it remains possible to directly use >`semantic-lex-punctuation-type' which is fine at handling compound >punctuations. It seems reasonable to me to have our default lexical analyzer match a sequence of punctuation, and call 'semantic-lex-push-token' multiple times as needed. Perhaps that would even be faster that the current one. Eric -- Eric Ludlam: za...@gn..., er...@si... Home: http://www.ludlam.net Siege: www.siege-engine.com Emacs: http://cedet.sourceforge.net GNU: www.gnu.org |
From: David P. <dav...@wa...> - 2003-12-12 08:29:47
|
Hi Eric, [...] > Explaining in the doc that the 'matchdatatype' property only affects > this special token (which is often implied) as a means for identifying > all other tokens in that class seems a but convoluted. Your > explanation here makes sense to me, but I was confused at first. > > %put is the right way to do it IMHO, but perhaps there is a way that > is more consistent. I recognize there is an ambiguity in the behavior of the %put statement that is different between keywords and tokens. IMO, that reflects another ambiguity related to the use of %token to declare both keywords are general purpose tokens. %token IF -> token %token <symbol> IF "if" -> token %token IF "if" -> keyword! Bison grammars, for example, doesn't suffer of a such ambiguity because there is no difference between keywords and other tokens. Only the lexer knows the difference, and it has its own input grammar. I don't think Semantic will benefit of separating the lexical grammar from the syntactic one. What I propose is to introduce (and encourage to use) a new `%keyword' statement to declare language keywords: %keyword IF "if" It would be a simple alias of the form %token IF "if" (for compatibility), but it would be far less ambiguous. And the semantic of the %put statement would be clearer: %keyword IF "if" %put IF property value %token <symbol> ID "[a-zA-Z0-9]+" %put symbol property value By tweaking a little the metagrammar, it should be even possible to allow lesser ambiguous forms, like: %put <symbol> matchdatatype regexp %put { <punctuation> <open-paren> } matchdatatype string > "matchdatatype" seems like a good word to me, unless compiler manuals > use some other term for when text in a stream is matched lexically, or > if it uses some other term for the type or style of the match. > > I do not recall any other such term from my days of a compiler writer. I do not recall either, so I adopt "matchdatatype" ;-) > The colon is a rather important operator / syntax element in the > metagrammar. It seems a bit odd to use is as a symbol constituent. > > Of course, we have a dual language grammar of both Emacs Lisp and a > LALR already so doing so depends on what this element belongs to. > > If the property in question is used most as a "slot" or field in Emacs > Lisp, the colon is standard. If it is a variable that is relevant to > the grammar itself, then $ seems like a more reasonable prefix. > Another symbol prefix we've used is %, but that's for grammar > declaration functions. Lastly, a naked symbol for other things. > > Hmmm, perhaps I convinced myself that : is a good prefix. I just used the : prefix because we already used it for built-in attributes/properties in tags, and because keywords prefixed with : are nicely highlighted. However, I haven't any problem to use `matchdatatype' if you prefer ;-) FYI, the lexer can easily differentiate the colon used as a symbol prefix from the colon used as a punctuation (I already did the necessary small change in semantic-grammar.el). And after all Emacs itself has that notion of colon-prefixed keywords ;-) [...] > It seems reasonable to me to have our default lexical analyzer match > a sequence of punctuation, and call 'semantic-lex-push-token' multiple > times as needed. Perhaps that would even be faster that the current > one. Even if we do what you propose, I think a simple compound analyzer that aggregates punctuations would be simpler to handle by `define-derived-lex-type-analyzer', that wouldn't have to re-aggregate punctuation characters from the token stream before trying to match them with specific lexical token values. Thanks! David |
From: Eric M. L. <er...@si...> - 2003-12-14 15:00:05
|
>>> David PONCE <dav...@wa...> seems to think that: >Hi Eric, > >[...] >> Explaining in the doc that the 'matchdatatype' property only affects >> this special token (which is often implied) as a means for identifying >> all other tokens in that class seems a but convoluted. Your >> explanation here makes sense to me, but I was confused at first. >> >> %put is the right way to do it IMHO, but perhaps there is a way that >> is more consistent. > >I recognize there is an ambiguity in the behavior of the %put >statement that is different between keywords and tokens. > >IMO, that reflects another ambiguity related to the use of %token to >declare both keywords are general purpose tokens. > >%token IF -> token >%token <symbol> IF "if" -> token >%token IF "if" -> keyword! Heh, yes, that is why I hope to avoid adding more ambiguities. ;) >Bison grammars, for example, doesn't suffer of a such ambiguity >because there is no difference between keywords and other tokens. >Only the lexer knows the difference, and it has its own input grammar. > >I don't think Semantic will benefit of separating the lexical grammar >from the syntactic one. What I propose is to introduce (and encourage >to use) a new `%keyword' statement to declare language keywords: > >%keyword IF "if" What a grand idea! >It would be a simple alias of the form %token IF "if" (for >compatibility), but it would be far less ambiguous. And the semantic >of the %put statement would be clearer: > >%keyword IF "if" >%put IF property value > >%token <symbol> ID "[a-zA-Z0-9]+" >%put symbol property value > >By tweaking a little the metagrammar, it should be even possible to >allow lesser ambiguous forms, like: > >%put <symbol> matchdatatype regexp >%put { <punctuation> <open-paren> } matchdatatype string Even now I scratch my head. I feel the need to attempt a nomenclature clarification. token - Something produced by the lexer metatoken - Something produced by compounding the output of the lexer but not produced by the grammar. keyword - A token made of symbol characters that represents an exact textual match. syntaxclass - A token produced by the lexer that represents a syntax class, such as <punctuation>. metakeyword - a token made of a characters from a syntaxclass that is not a keyword, but is more specific than a syntax class. matchdatatype - A description on how syntaxclass is matched against the raw data to produce a keyword or metakeyword property - A named value associated with a lexical token. Ahh, that's better. Since keyword, syntaxclass, and metakeyword are all tokens, it is ok to %put properties on them. I guess it is ok to use the %token command to declare them as well. Some better names might be in order though. [ ... ] >> The colon is a rather important operator / syntax element in the >> metagrammar. It seems a bit odd to use is as a symbol constituent. >> >> Of course, we have a dual language grammar of both Emacs Lisp and a >> LALR already so doing so depends on what this element belongs to. >> >> If the property in question is used most as a "slot" or field in Emacs >> Lisp, the colon is standard. If it is a variable that is relevant to >> the grammar itself, then $ seems like a more reasonable prefix. >> Another symbol prefix we've used is %, but that's for grammar >> declaration functions. Lastly, a naked symbol for other things. >> >> Hmmm, perhaps I convinced myself that : is a good prefix. > >I just used the : prefix because we already used it for built-in >attributes/properties in tags, and because keywords prefixed with : >are nicely highlighted. However, I haven't any problem to use >`matchdatatype' if you prefer ;-) > >FYI, the lexer can easily differentiate the colon used as a symbol >prefix from the colon used as a punctuation (I already did the >necessary small change in semantic-grammar.el). And after all Emacs >itself has that notion of colon-prefixed keywords ;-) It just looked a bit odd at first, but now I agree that the :colon based property names are fine. It may be worth changing the summary property to :summary too, unless we want to differentiate parser functionality properties from application properties. >[...] >> It seems reasonable to me to have our default lexical analyzer match >> a sequence of punctuation, and call 'semantic-lex-push-token' multiple >> times as needed. Perhaps that would even be faster that the current >> one. > >Even if we do what you propose, I think a simple compound analyzer >that aggregates punctuations would be simpler to handle by >`define-derived-lex-type-analyzer', that wouldn't have to re-aggregate >punctuation characters from the token stream before trying to match >them with specific lexical token values. [ ... ] We should do whatever makes it easiest for a new person to make their grammar work. I suspect new grammar writers are more interested in their grammar than in their lexical analyzer. ;) >>> David PONCE <dav...@wa...> seems to think that: >Eric, > >Here is the new implementation (not yet tested) of >`define-derived-lex-type-analyzer' that take into account the >`matchdatatype' property of the token lexical type. > >Of course, I would appreciate much your feedback ;-) [ ... ] > >(defun semantic--lex-type-refinement-form () > "Return a form to refine the type of the last token found. >At this point, the last token found is on top of lexical stream. > >Refinement is based on more specific token definitions provided in the >current lexical token table for the refined type. > >If the value of the refined token matches any of the more specific >values, the corresponding specific token replaces the initial one >on top of the lexical stream. > >When the `matchdatatype' property of the refined type is the symbol >`string', the refined token value is compared with `equal' to each >specific token value. Otherwise `string-match' is used." > (let* ((tok (make-symbol "tok")) > (typ (make-symbol "typ")) > (val (make-symbol "val")) > (lst (make-symbol "lst")) > (def (make-symbol "def")) > (elt (make-symbol "elt")) > (pos (make-symbol "pos")) > (end (make-symbol "end")) > (len (make-symbol "len"))) > `(let* ((,tok (car semantic-lex-token-stream)) > (,typ (semantic-lex-token-class ,tok)) > (,val (semantic-lex-token-text ,tok)) > (,lst (semantic-lex-type-value (symbol-name ,typ t))) > (,def (car ,lst)) ;; default lexical token or nil > (,lst (cdr ,lst)) ;; alist of (TOKEN . MATCH-STRING) > ,elt) > (when ,lst > ;; Search for a matching lexical token [ ... ] If I understand this code correctly, the goal is to take the token stream such as (in simplified form): ("=" "+" ...) and convert it into (PLUSEQUAL ...) or some such? Perhaps the code should be organized as such: (let ((alltokensofsameclass (fancy code))) (when (> (length alltokensofsameclass) 1) ;; do stuff )) to cut back on the amount of functional execution done before deciding that, nope, nothing to do here. It could simplify inner loops as well. > >(defmacro define-derived-lex-type-analyzer (name analyzer &optional doc) > "Define a generic type analyzer with NAME, derived from ANALYZER. >ANALYZER must be the name of a previously defined lexical analyzer. >Optional argument DOC is the new analyzer doc string. > >The generic type analyzer NAME will filter tokens produced by >ANALYZER, based on values found in the current table of lexical tokens >for the type of tokens returned by ANALYZER, to return a more specific >lexical token. > >For example, to detect the lexical tokens corresponding to these >grammar declarations of keywords and symbols: > > %token IF \"if\" ; keyword 'if' > %token THEN \"then\" ; keyword 'then' > %token <symbol> ID ; default lexical symbol > %token <symbol> VAR \"^[$]\" ; variable names start with $ > >Define a generic type analyzer derived from the basic analyzer >`semantic-lex-symbol-or-keyword': > > (define-derived-lex-type-analyzer semantic-lex-keyword-or-symbol-type > semantic-lex-symbol-or-keyword) > >>From this sample input stream: > > if $val then result = $val Perhaps your example could also be: if $val then result += $val as a way of adding a compound punctuation to the mix? >It will automatically detect and returns the following lexical tokens: > > (IF 1 . 3) ; the keyword IF > (VAR 4 . 8) ; a dollar variable > (THEN 9 . 13) ; the keyword THEN > (ID 14 . 20) ; a generic identifier Are you missing the (EQUAL 21 . 22) here? > (VAR 23 . 27) ; a dollar variable" > (let ((code (symbol-value analyzer))) > `(define-lex-analyzer ,name > ,doc > ,(car code) > ,@(cdr code) > ,(semantic--lex-type-refinement-form) > ))) It appears that the refinement form runs after every token of a given syntax class is found. I suspect that nearly all analyzers will eventually do this except perhaps whitespace and comments. Do you think it would make sense to have a refinement occur at the end of every pass through the lexical analyzer? Positioning it as such could allow for some good heuristics for not running the refinement step. I may be off a bit, I'm not sure I have a complete understanding yet. Thanks! Eric -- Eric Ludlam: za...@gn..., er...@si... Home: http://www.ludlam.net Siege: www.siege-engine.com Emacs: http://cedet.sourceforge.net GNU: www.gnu.org |
From: David P. <dav...@wa...> - 2003-12-12 10:24:36
|
Eric, Here is the new implementation (not yet tested) of `define-derived-lex-type-analyzer' that take into account the `matchdatatype' property of the token lexical type. Of course, I would appreciate much your feedback ;-) David (defun semantic--lex-type-refinement-form () "Return a form to refine the type of the last token found. At this point, the last token found is on top of lexical stream. Refinement is based on more specific token definitions provided in the current lexical token table for the refined type. If the value of the refined token matches any of the more specific values, the corresponding specific token replaces the initial one on top of the lexical stream. When the `matchdatatype' property of the refined type is the symbol `string', the refined token value is compared with `equal' to each specific token value. Otherwise `string-match' is used." (let* ((tok (make-symbol "tok")) (typ (make-symbol "typ")) (val (make-symbol "val")) (lst (make-symbol "lst")) (def (make-symbol "def")) (elt (make-symbol "elt")) (pos (make-symbol "pos")) (end (make-symbol "end")) (len (make-symbol "len"))) `(let* ((,tok (car semantic-lex-token-stream)) (,typ (semantic-lex-token-class ,tok)) (,val (semantic-lex-token-text ,tok)) (,lst (semantic-lex-type-value (symbol-name ,typ t))) (,def (car ,lst)) ;; default lexical token or nil (,lst (cdr ,lst)) ;; alist of (TOKEN . MATCH-STRING) ,elt) (when ,lst ;; Search for a matching lexical token (if (eq 'string (semantic-lex-type-get ,typ 'matchdatatype t)) ;; Use string comparisons (let* ((,pos (semantic-lex-token-start ,tok)) (,end (semantic-lex-token-end ,tok)) (,len (- end pos))) ;; Starting with the longest one, search if a lexical ;; value match a token defined for this language. (while (and (> ,len 0) (not (setq ,elt (rassoc ,val ,lst)))) (setq ,len (1- ,len) ,val (substring ,val 0 ,len))) (when elt ;; Adjust the stream and token end position (setq semantic-lex-end-point (+ ,pos ,len)) ;;;; Probably it would be better to have an API to modify a lexical ;;;; token by side effect. (setcdr (semantic-lex-token-bounds ,tok) semantic-lex-end-point))) ;; Use regexp match (while (and ,lst (not ,elt)) (setq ,elt (and (string-match (cdar ,lst) ,val) (caar ,lst)) ,lst (cdr ,lst))))) ;; If not found, use a default lexical token if ;; provided, or the initial token type otherwise. ;;;; Probably it would be better to have an API to modify a lexical ;;;; token by side effect. (setcar ,tok (or ,elt ,def ,typ))))) (defmacro define-derived-lex-type-analyzer (name analyzer &optional doc) "Define a generic type analyzer with NAME, derived from ANALYZER. ANALYZER must be the name of a previously defined lexical analyzer. Optional argument DOC is the new analyzer doc string. The generic type analyzer NAME will filter tokens produced by ANALYZER, based on values found in the current table of lexical tokens for the type of tokens returned by ANALYZER, to return a more specific lexical token. For example, to detect the lexical tokens corresponding to these grammar declarations of keywords and symbols: %token IF \"if\" ; keyword 'if' %token THEN \"then\" ; keyword 'then' %token <symbol> ID ; default lexical symbol %token <symbol> VAR \"^[$]\" ; variable names start with $ Define a generic type analyzer derived from the basic analyzer `semantic-lex-symbol-or-keyword': (define-derived-lex-type-analyzer semantic-lex-keyword-or-symbol-type semantic-lex-symbol-or-keyword) From this sample input stream: if $val then result = $val It will automatically detect and returns the following lexical tokens: (IF 1 . 3) ; the keyword IF (VAR 4 . 8) ; a dollar variable (THEN 9 . 13) ; the keyword THEN (ID 14 . 20) ; a generic identifier (VAR 23 . 27) ; a dollar variable" (let ((code (symbol-value analyzer))) `(define-lex-analyzer ,name ,doc ,(car code) ,@(cdr code) ,(semantic--lex-type-refinement-form) ))) |
From: David P. <dav...@wa...> - 2003-12-15 11:31:37
|
Hi Eric, [...] >> What I propose is to introduce (and encourage to use) a new >> `%keyword' statement to declare language keywords: >> >>%keyword IF "if" > > What a grand idea! > >>It would be a simple alias of the form %token IF "if" (for >>compatibility), but it would be far less ambiguous. And the semantic >>of the %put statement would be clearer: >> >>%keyword IF "if" >>%put IF property value >> >>%token <symbol> ID "[a-zA-Z0-9]+" >>%put symbol property value >> >>By tweaking a little the metagrammar, it should be even possible to >>allow lesser ambiguous forms, like: >> >>%put <symbol> matchdatatype regexp >>%put { <punctuation> <open-paren> } matchdatatype string I updated semantic-grammar.wy (and generated semantic-grammar-wy.el) to introduce the new %keyword statement and allow <type> forms in %put statements. If you haven't objection, I will check-in the changes. Here is the change log: 2003-12-12 David Ponce <da...@dp...> (NOT YET COMMITTED) * cedet/semantic/semantic-grammar.el (semantic-grammar-lex-symbol): Accept colon as symbol prefix. (semantic-grammar-anchored-indentation): Take care of colon used as symbol prefix. * cedet/semantic/semantic-grammar.wy Introduce new %keyword statement to declare keywords. Allow use of <token-type> forms in %put statements. (use_names): Add to start symbols. (KEYWORD): New keyword. (DEFAULT-PREC, NO-DEFAULT-PREC, KEYWORD, LANGUAGEMODE) (LEFT, NONASSOC, PACKAGE, PREC, PUT, QUOTEMODE, RIGHT) (SCOPESTART, START, TOKEN, USE-MACROS): Declare with %keyword instead of %token. (KEYWORDTABLE, OUTPUTFILE, PARSETABLE, SETUPFUNCTION) (TOKENTABLE): Remove. Obsolete. (decl): Add keyword_decl rule. Remove obsolete rules. (put_decl): Use put_name rule instead of SYMBOL. (put_names): Likewise. (put_name, keyword_decl, use_names): New rules. (use_name_list): New rule. (use_macros_decl): Use it. (keywordtable_decl, outputfile_decl, parsetable_decl) (setupfunction_decl, tokentable_decl): Remove. Obsolete. * cedet/semantic/semantic-grammar-wy.el Re-generated. > Even now I scratch my head. I feel the need to attempt a > nomenclature clarification. Good idea! > token - Something produced by the lexer OK. > metatoken - Something produced by compounding the output of the lexer > but not produced by the grammar. The `define-derived-lex-type-analyzer' doesn't really compound the output of the lexer. It introduces a more subtle matching algorithm (based on information provided in the grammar), to derive ONE (probably syntax-class-oriented) token to another token. The true added value of `define-derived-lex-type-analyzer' is to take benefit of lexical declarations provided by the grammar. For example `semantic-lex-symbol-or-keyword' produce `symbol' tokens from stream of characters which are symbol constituents (Emacs syntax classes \sw and \s_). It don't use grammar declarations. Using `define-derived-lex-type-analyzer', bring grammar informations to `semantic-lex-symbol-or-keyword', so it can analyze symbols in a more subtle manner. For example it can distinguish between general purpose identifiers and special ones with a dollar prefix, based on grammar: %token <symbol> DOLLARID "^[$]" %token <symbol> IDENTIFIER In all cases, it remains possible to achieve the same result by using own made analyzers (this is what it is done for now). The issue can be that the developer is responsible (so he can fail) to keep its own made analyzers consistent with the grammar. > keyword - A token made of symbol characters that represents an exact > textual match. OK. > syntaxclass - A token produced by the lexer that represents a > syntax class, such as <punctuation>. IMO, this notion is more generally a token type or token category close to Bison's notion of token data type. In certain cases token types correspond to Emacs syntax classes (like punctuation, or open/close-paren). But this is not required. For example the `semantic-list' type is a convenient token type that gives a high level view of data between matching open/close-paren characters. There is no real correspondence between the `semantic-list' type and an Emacs syntax-class. Maybe could we introduce the term `meta-class' to design token classes from which are derived other token classes. For example `punctuation' is a token class (accessible via `semantic-lex-token-class'). If other token classes like COMMA, EQEQ, etc., are derived from it, `punctuation' becomes naturally a meta-class ;-) By extension tokens in a `meta-class' would naturally becomes `meta-tokens'. > metakeyword - a token made of a characters from a syntaxclass that is > not a keyword, but is more specific than a syntax class. See `meta-token' above ;-) IMO, meta-keyword can be confusing because such tokens are not related to keywords. > > matchdatatype - A description on how syntaxclass is matched against > the raw data to produce a keyword or metakeyword I would prefer: A token [meta-]class property that describes how a [meta-]token value is matched against the raw data to produce a derived token. > > property - A named value associated with a lexical token. ^^^^^ token class [...] > It just looked a bit odd at first, but now I agree that the :colon > based property names are fine. It may be worth changing the summary > property to :summary too, unless we want to differentiate parser > functionality properties from application properties. Isn't `summary' a lexer functionality (it gives a keyword a description)? Anyway, I think using a homogeneous notation is good practice. To avoid another "migration-ache" (there are already `summary' and `javadoc' (perhaps others?) which are widely used.), I propose to use `matchdatatype' without the colon prefix and continue with that convention. > We should do whatever makes it easiest for a new person to make their > grammar work. I suspect new grammar writers are more interested in > their grammar than in their lexical analyzer. ;) You're certainly right ;-) [...] >>(defun semantic--lex-type-refinement-form () >> "Return a form to refine the type of the last token found. >>At this point, the last token found is on top of lexical stream. >> >>Refinement is based on more specific token definitions provided in the >>current lexical token table for the refined type. >> >>If the value of the refined token matches any of the more specific >>values, the corresponding specific token replaces the initial one >>on top of the lexical stream. >> >>When the `matchdatatype' property of the refined type is the symbol >>`string', the refined token value is compared with `equal' to each >>specific token value. Otherwise `string-match' is used." >> (let* ((tok (make-symbol "tok")) >> (typ (make-symbol "typ")) >> (val (make-symbol "val")) >> (lst (make-symbol "lst")) >> (def (make-symbol "def")) >> (elt (make-symbol "elt")) >> (pos (make-symbol "pos")) >> (end (make-symbol "end")) >> (len (make-symbol "len"))) >> `(let* ((,tok (car semantic-lex-token-stream)) >> (,typ (semantic-lex-token-class ,tok)) >> (,val (semantic-lex-token-text ,tok)) >> (,lst (semantic-lex-type-value (symbol-name ,typ t))) >> (,def (car ,lst)) ;; default lexical token or nil >> (,lst (cdr ,lst)) ;; alist of (TOKEN . MATCH-STRING) >> ,elt) >> (when ,lst >> ;; Search for a matching lexical token > > [ ... ] > > If I understand this code correctly, the goal is to take the token > stream such as (in simplified form): ("=" "+" ...) and convert it > into (PLUSEQUAL ...) or some such? No the goal it to take the last token read (on top of the token stream), and to refine its class based on criteria from the grammar. [...] >>>From this sample input stream: >> >> if $val then result = $val > > > Perhaps your example could also be: > > if $val then result += $val > > as a way of adding a compound punctuation to the mix? The doc string refers only to tokens produced by the `semantic-lex-keyword-or-symbol-type' sample. It doesn't handle punctuations. Probably it would be worth adding a second example that uses punctuations and string `matchdatatype'. >>It will automatically detect and returns the following lexical tokens: >> >> (IF 1 . 3) ; the keyword IF >> (VAR 4 . 8) ; a dollar variable >> (THEN 9 . 13) ; the keyword THEN >> (ID 14 . 20) ; a generic identifier > > > Are you missing the (EQUAL 21 . 22) here? See above ;-) [...] > It appears that the refinement form runs after every token of a given > syntax class is found. I suspect that nearly all analyzers will > eventually do this except perhaps whitespace and comments. > > Do you think it would make sense to have a refinement occur at the > end of every pass through the lexical analyzer? Positioning it as > such could allow for some good heuristics for not running the > refinement step. There is no need to do the refinement step for own made analyzers written for speed reason, or to handle things that can't be specified with regexp or string match (`semantic-grammar-lex-epilogue' is a good example). So I think it is better that only "automatic" analyzers pay the extra cost of a refinement step. Thanks for all these good remarks! David |
From: Eric M. L. <er...@si...> - 2003-12-15 22:47:32
|
Howdy, >>> David PONCE <dav...@wa...> seems to think that: >Hi Eric, > >[...] [ ... ] >Here is the change log: > >2003-12-12 David Ponce <da...@dp...> (NOT YET COMMITTED) > > * cedet/semantic/semantic-grammar.el > > (semantic-grammar-lex-symbol): Accept colon as symbol prefix. > (semantic-grammar-anchored-indentation): Take care of colon used > as symbol prefix. Below you state that you were going to drop the : on the matchdatatype property. Is this still necessary? Otherwise, I think it looks good. [ ... ] >> Even now I scratch my head. I feel the need to attempt a >> nomenclature clarification. > >Good idea! Thanks for explaining! I will save your message and try to get that into the doc for lexical analysis. Thanks Eric -- Eric Ludlam: za...@gn..., er...@si... Home: http://www.ludlam.net Siege: www.siege-engine.com Emacs: http://cedet.sourceforge.net GNU: www.gnu.org |
From: David P. <dav...@wa...> - 2003-12-16 11:21:14
|
Hi Eric, [...] > Below you state that you were going to drop the : on the > matchdatatype property. Is this still necessary? I think it is better for the grammar to correctly handle the colon prefix in case a developer want to use it for its own "private" properties. We aren't compelled to use that convention ;-) > > Otherwise, I think it looks good. OK. I will check the changes in. [...] > Thanks for explaining! I will save your message and try to get that > into the doc for lexical analysis. Thanks for urging me to explain ;-) I appreciate to have a break in implementing things, and to take the time to clarify all these subtle notions ;-) I thought more on how to take more benefit of what is put in the grammar to simplify writing of lexical analyzers, and I wonder if it would be worth exploring this new direction: directly generate analyzers in the <language>-[wb]y.el file. Thus, the developer would have the opportunity to either use generated analyzers, or implement its own ones. The advantage would be a more efficient use of the existing lexical API, without the need of a second pass analysis. For example we could imagine that these declarations in a foo.wy grammar: %token <symbol> DOLLARVAR "^[$]" %token <symbol> OTHERVAR %token <punctuation> EQ "=" %token <punctuation> NE "^=" %token <punctuation> GT ">" %token <punctuation> GE ">=" would generate something like this in the foo-wy.el file: (define-lex-regex-type-analyzer foo-wy--symbol-analyzer ;; regexp to grab symbol syntax "\\(\\sw\\|\\s_\\)+" ;; regexps to detect specific language symbols ((DOLLARVAR . "^[$]")) ;; Default token OTHERVAR "foo symbol regexp type analyzer.") (define-lex-string-type-analyzer foo-wy--punctuation-analyzer ;; regexp to grab punctuation syntax "\\(\\s.\\|\\s$\\|\\s'\\)+" ;; strings to detect specific language punctuations '((EQ . "=") (NE . "^=") (GT . ">") (GE . ">=")) ;; Default token 'punctuation "foo punctuation string type analyzer.") Using a bison like %type statement we could give properties to a <type> (and use them at generation time) like this: %type <symbol> syntax "\\(\\sw\\|\\s_\\)+" matchdatatype regexp %type <punctuation> syntax "\\(\\s.\\|\\s$\\|\\s'\\)+" matchdatatype string Properties would give the syntax regexp to use to grab a sequence of <type> characters, and the matchdatatype algorithm to use to match specific tokens. Other properties can be imagined for other situations (block analysis, etc.). For well known <type>, like <symbol>, <punctuation>, etc., we could provide a default property list. Overriding properties would be achieved by merging the default property list and the one specified by the %type statement. Keywords would be handled specifically using a built-in `semantic-lex-keyword' analyzer that should be put before other symbol analyzers in the lexer definition. Consequently the %put statement would be exclusively reserved for keyword properties. To summarize: - Keywords %keyword to define them (possibly using %token for compatibility) %put to assign properties - Other tokens %token to define them. %type to assign properties So definitively less ambiguities, and more efficiency. Oops! And a lot of things to do ;-) What do you think? David |
From: Eric M. L. <er...@si...> - 2003-12-16 19:42:59
|
Hi, >>> David PONCE <dav...@wa...> seems to think that: [ ... ] >I thought more on how to take more benefit of what is put in the >grammar to simplify writing of lexical analyzers, and I wonder if it >would be worth exploring this new direction: directly generate >analyzers in the <language>-[wb]y.el file. Thus, the developer would >have the opportunity to either use generated analyzers, or implement >its own ones. This seems like a good idea. >The advantage would be a more efficient use of the existing lexical >API, without the need of a second pass analysis. Ah, speed is good too. >For example we could imagine that these declarations in a foo.wy >grammar: > >%token <symbol> DOLLARVAR "^[$]" >%token <symbol> OTHERVAR > >%token <punctuation> EQ "=" >%token <punctuation> NE "^=" >%token <punctuation> GT ">" >%token <punctuation> GE ">=" > >would generate something like this in the foo-wy.el file: > >(define-lex-regex-type-analyzer foo-wy--symbol-analyzer > ;; regexp to grab symbol syntax > "\\(\\sw\\|\\s_\\)+" > ;; regexps to detect specific language symbols > ((DOLLARVAR . "^[$]")) > ;; Default token > OTHERVAR > "foo symbol regexp type analyzer.") > >(define-lex-string-type-analyzer foo-wy--punctuation-analyzer > ;; regexp to grab punctuation syntax > "\\(\\s.\\|\\s$\\|\\s'\\)+" > ;; strings to detect specific language punctuations > '((EQ . "=") > (NE . "^=") > (GT . ">") > (GE . ">=")) > ;; Default token > 'punctuation > "foo punctuation string type analyzer.") Adding this would then allow you to remove the existing wisent only compounding mechanism for these symbols. It seems unlikely anyone would want to use any other mechanism for creating specific analyzers of this nature. Even so, it might be worth having command in the grammar that states: %lex <punctuation> my-analyzer or some such in case of naming conflicts, though that could be unlikely. If someone doesn't want the auto-generated analyzer, they could skip adding such a command. >Using a bison like %type statement we could give properties to a ><type> (and use them at generation time) like this: > >%type <symbol> syntax "\\(\\sw\\|\\s_\\)+" > matchdatatype regexp > >%type <punctuation> syntax "\\(\\s.\\|\\s$\\|\\s'\\)+" > matchdatatype string > >Properties would give the syntax regexp to use to grab a sequence of ><type> characters, and the matchdatatype algorithm to use to match >specific tokens. Other properties can be imagined for other >situations (block analysis, etc.). I like the idea of allowing a declaration of specific syntax regexps. In C, I suppose I could have: %type <ifdef> syntax "^#ifdef" matchdatatype string too? >For well known <type>, like <symbol>, <punctuation>, etc., we could >provide a default property list. Overriding properties would be Having good defaults is important too. The use of syntax tables will make it unnecessary to specify a regexp most of the time. >achieved by merging the default property list and the one specified >by the %type statement. I'm a little concerned about using the name "%type" for the command. Users starting with a bison background could be confused by this since it is really just a fancy form of "%put". Unfortunately I do not know what a good alternative would be. For example, I might expect: %type <symbol> "[0-9]+" wholenump or something like that. >Keywords would be handled specifically using a built-in >`semantic-lex-keyword' analyzer that should be put before other symbol >analyzers in the lexer definition. Consequently the >%put statement would be exclusively reserved for keyword properties. Using: %token THINGY "thingy" %put THINGY summary "A useful thingy" makes sense. I wonder if: %token <punctuation> EQUALEQUAL "==" %put EQUALEQUAL summary "test for equivalence" could also be useful iff we update things so eldoc can comment on ==. Here are some other possibilities that makes more sense with %put than %type %put COMMA argumentseparator t %put DOT typerelationseparator t or, perhaps it would be better to have: %set function-argument-separation-character COMMA to declare variables in the lisp code. Hmmm, perhaps it would be better to leave that in the Lisp code only. >To summarize: > >- Keywords > > %keyword to define them (possibly using %token for compatibility) > %put to assign properties > >- Other tokens > > %token to define them. > %type to assign properties > >So definitively less ambiguities, and more efficiency. >Oops! And a lot of things to do ;-) > >What do you think? [ ... ] I think this sounds like a good idea. Eric -- Eric Ludlam: za...@gn..., er...@si... Home: http://www.ludlam.net Siege: www.siege-engine.com Emacs: http://cedet.sourceforge.net GNU: www.gnu.org |
From: Eric M. L. <er...@si...> - 2003-12-11 00:24:01
|
>>> David PONCE <dav...@wa...> seems to think that: >Hi Eric, > >[...] >> That's a pretty interesting idea. Your extraction and use of the >> existing analyzer is quite clever. I had asked about the API layers >> in a previous email. It seems that the derived lexical analyzer is >> still a part of the core lexical API as opposed in some intermediate >> layer. That's probably fine. There seems to be a lot of lexical >> generated tables and code already. >> >> In your code: >> >> >>> ;; Search for a matching lexical token >>> (while (and ,lst (not ,elt)) >>> (setq ,elt (and (string-match (cdar ,lst) ,val) (caar ,lst)) >>> ,lst (cdr ,lst))) >> >> >> would an obarray or hash table be better? The keyword table is >> quite successful. I know that in your sample you are trying to match >> "^$" as VAR. That feature is important, but I think that explicit >> string matches is more common and could be made faster for the >> punctuation types. Something separate for symbols and lists may be in >> order. > >You're right. That's funny because I already implemented a similar >solution in the old `wisent-flex' lexer. Perhaps could we use the >same approach here. To distinguish between string and regexp matches, >`wisent-flex' used properties of symbols in the token table (which is >an obarray of the token type symbols). > >By default certain token types, like punctuation, were setup to use >string matches (this is the purpose of `wisent-lex-make-token-table' >compared to stock `semantic-lex-make-type-table', but it will be >easy to do that in `semantic-lex-make-type-table' and remove >`wisent-lex-make-token-table'). > >The advantage of that design is its simplicity, and especially that >it allows customization using grammar %PUT statements. That seems like a really good idea. Changing properties of lexical symbols is what the %put command is all about. >For example you could have: > >%token <punctuation> COMMA "," >%token <punctuation> EQ "=" > >By default it is assumed that there is an implicit > >%PUT punctuation string t > >which, for speed, indicates to recognize punctuation using string >matches (a la `semantic-lex-punctuation-type'). > >But you could have also something like this: > >%token <punctuation> COMPARATOR "[<>][=]?" >%put punctuation string nil Perhaps you mean: %put COMPARITOR string nil ? >that indicates to use regexp matches to recognize punctuation. %put THING string t seems good, but %put THING string nil does not say "regexp" to me. Perhaps something like this: %put THING lexicalcomparetype string or %put THING matchdatatype regexp would be better? >Depending on the `string' property of the token type symbol, it should >be easy for`define-derived-lex-type-analyzer' to generate the ad-hoc >match algorithm. [ ... ] >> Also, it appears this would not work for compound tokens like "=>" >> as this analyzer would only work in character groups defined by the >> originating analyzer. Is this assumption true? > >I don't think so. The "string matches" algorithm used in >`semantic-lex-punctuation-type' is particularly adapted to match >compound punctuations ;-) [ ... ] I thought the entire raw lexical stream was compounded by the wisent-lex layer. If you use the default punctuation analyzer, it will only ever match a single character. You would need to extend a different punctuation system that knows to combine => but not other symbols that make no sense, like >=. I like the direction your proposed function is going. Very nice. Eric -- Eric Ludlam: za...@gn..., er...@si... Home: http://www.ludlam.net Siege: www.siege-engine.com Emacs: http://cedet.sourceforge.net GNU: www.gnu.org |