Thread: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.
Brought to you by:
zappo
From: Oleg S. <ole...@gm...> - 2014-07-02 19:30:18
|
Hi list, I'm getting this message: senator-next-tag: Buffer was not parsed by Semantic. After I thought I've compiled a simple grammar for a mode that I'm trying to test. Below is the initialization mode stuff: (define-derived-mode fmt-mode fundamental-mode "Common Lisp Format mode" "Major for highlighting of Common Lisp format mini-language This mode uses its own keymap: \\{fmt-mode-map}" (kill-all-local-variables) (setq major-mode 'fmt-mode) (use-local-map fmt-mode-map) (setf mode-name "Common Lisp Format") (run-hooks 'fmt-mode-hook) (semantic-mode 1)) Nothing fancy, I'm sure it reaches the (semantic-mode 1) call. I have a fmt.wy file from which I can generate a ftm-wy.el which has the following: (defun fmt-wy--install-parser () "Setup the Semantic Parser." (semantic-install-function-overrides '((parse-stream . wisent-parse-stream))) (setq semantic-parser-name "LALR" semantic--parse-table fmt-wy--parse-table semantic-debug-parser-source "fmt.wy" semantic-flex-keywords-obarray fmt-wy--keyword-table semantic-lex-types-obarray fmt-wy--token-table) ;; Collect unmatched syntax lexical tokens (semantic-make-local-hook 'wisent-discarding-token-functions) (add-hook 'wisent-discarding-token-functions 'wisent-collect-unmatched-syntax nil t)) (define-lex wisent-fmt-lexer "Lexical analyzer that handles Common Lisp format." semantic-lex-ignore-newline semantic-lex-ignore-comments semantic-lex-default-action) (provide 'fmt-wy) I can require fmt-wy allright (it gives some warnings, but they don't seem to be important) But now parsing seems to be happening in the test file I'm trying to edit. What did I have to do beside what I've done? Also, how would I debug reduce conflicts? Is there any way to make Semantic more verbose when reporting them? The report of having a reduce conflic is really like pointing a finger at the sky... unless you give a hint about what terminals or rules are in conflict. Lastly, sorry I put many issues together! Is there a way to create character classes, such as, for example "any character but tilda"? Well, actually, negation would help my case too, but just for general knowledge I'd like, if possible, to know the answer to the character classes question too! Thanks, Oleg |
From: Left R. <ole...@gm...> - 2014-07-02 19:34:18
|
Sorry, I forgot to mention, my fmt.wy file has this: %languagemode fmt-mode (I believe this should make Semantics use the parser in the fmt-mode, shouldn't it?) On Wed, Jul 2, 2014 at 10:28 PM, Oleg Sivokon <ole...@gm...> wrote: > Hi list, > I'm getting this message: > > senator-next-tag: Buffer was not parsed by Semantic. > > After I thought I've compiled a simple grammar for a mode that I'm > trying to test. Below is the initialization mode stuff: > > (define-derived-mode fmt-mode fundamental-mode > "Common Lisp Format mode" > "Major for highlighting of Common Lisp format mini-language > This mode uses its own keymap: > \\{fmt-mode-map}" > (kill-all-local-variables) > (setq major-mode 'fmt-mode) > (use-local-map fmt-mode-map) > (setf mode-name "Common Lisp Format") > (run-hooks 'fmt-mode-hook) > (semantic-mode 1)) > > Nothing fancy, I'm sure it reaches the (semantic-mode 1) call. > > I have a fmt.wy file from which I can generate a ftm-wy.el which has the > following: > > (defun fmt-wy--install-parser () > "Setup the Semantic Parser." > (semantic-install-function-overrides > '((parse-stream . wisent-parse-stream))) > (setq semantic-parser-name "LALR" > semantic--parse-table fmt-wy--parse-table > semantic-debug-parser-source "fmt.wy" > semantic-flex-keywords-obarray fmt-wy--keyword-table > semantic-lex-types-obarray fmt-wy--token-table) > ;; Collect unmatched syntax lexical tokens > (semantic-make-local-hook 'wisent-discarding-token-functions) > (add-hook 'wisent-discarding-token-functions > 'wisent-collect-unmatched-syntax nil t)) > > (define-lex wisent-fmt-lexer > "Lexical analyzer that handles Common Lisp format." > semantic-lex-ignore-newline > semantic-lex-ignore-comments > semantic-lex-default-action) > > (provide 'fmt-wy) > > I can require fmt-wy allright (it gives some warnings, but they don't > seem to be important) But now parsing seems to be happening in the test > file I'm trying to edit. What did I have to do beside what I've done? > > Also, how would I debug reduce conflicts? Is there any way to make > Semantic more verbose when reporting them? The report of having a reduce > conflic is really like pointing a finger at the sky... unless you give a > hint about what terminals or rules are in conflict. > > Lastly, sorry I put many issues together! Is there a way to create > character classes, such as, for example "any character but tilda"? Well, > actually, negation would help my case too, but just for general > knowledge I'd like, if possible, to know the answer to the character > classes question too! > > Thanks, > > Oleg |
From: Eric M. L. <er...@si...> - 2014-07-03 11:25:33
|
On 07/02/2014 03:28 PM, Oleg Sivokon wrote: > Hi list, > I'm getting this message: > > senator-next-tag: Buffer was not parsed by Semantic. > > After I thought I've compiled a simple grammar for a mode that I'm > trying to test. Below is the initialization mode stuff: > > (define-derived-mode fmt-mode fundamental-mode > "Common Lisp Format mode" > "Major for highlighting of Common Lisp format mini-language > This mode uses its own keymap: > \\{fmt-mode-map}" > (kill-all-local-variables) > (setq major-mode 'fmt-mode) > (use-local-map fmt-mode-map) > (setf mode-name "Common Lisp Format") > (run-hooks 'fmt-mode-hook) > (semantic-mode 1)) > > Nothing fancy, I'm sure it reaches the (semantic-mode 1) call. Hi Oleg, `semantic-mode' only needs to be called once when you start Emacs. To get your mode setup for parsing via Semantic you need to add your setup function to `semantic-new-buffer-setup-functions'. I suppose you could also just call your setup function directly from your mode too if you wanted, but then your mode would depend on Semantic directly. For a fresh new mode, you would need 3 files: blah-mode.el - The standard Emacs mode for your mode. blah.wy & blah-wy.el - The parser and generated file. semantic-blah.el or wisent-blah.el - The hand written support code for the parser. The support file will have your -setup function. The setup function will call your --install-parser function, setup any special variables needed when Semantic is active (such as which lexer to use and any override variables such as how to convert tag classes into nice strings. You could look at SRecode's template mode as an example. It has everything together in that case there is: srt.wy srt-wy.el template.el - hand written support file > > I can require fmt-wy allright (it gives some warnings, but they don't > seem to be important) But now parsing seems to be happening in the test > file I'm trying to edit. What did I have to do beside what I've done? > > Also, how would I debug reduce conflicts? Is there any way to make > Semantic more verbose when reporting them? The report of having a reduce > conflic is really like pointing a finger at the sky... unless you give a > hint about what terminals or rules are in conflict. Hopefully the message will help identify the problem. There is a short section in the 'wisent' doc for how to fix them. You could also check Bison's doc, as the technique is the same. Sadly it is more by code inspection than a debugger. > Lastly, sorry I put many issues together! Is there a way to create > character classes, such as, for example "any character but tilda"? Well, > actually, negation would help my case too, but just for general > knowledge I'd like, if possible, to know the answer to the character > classes question too! You will need to create a custom lex rule. That uses Emacs regex rules. Thus you could create "[^~]" for anything but tilde, or "[~]" for only tilde's. Check the elisp manual for all the fun regexp rules. Good Luck Eric |
From: Eric M. L. <er...@si...> - 2014-07-12 01:40:20
|
Hi Oleg, I'm not sure how to debug the fcns you posted below. I think they are ok. Since you appear to be defining your own mode, let me instead annotate through how the the parsing for "dot" works which is found in these files: lisp/cedet/cogre/dot-mode.el lisp/cedet/cogre/wisent-dot.wy lisp/cedet/cogre/wisent-dot.el and generated file lisp/cedet/cogre/wisent-dot-wy.el I picked this mode because it is pretty simple, just enough to get the layout code of COGRE working. It is also not installed by default in semantic-new-buffer-setup-functions. Lets start in dot-mode.el: Note the syntax table. This part is critical for the lexer to work. If you duplicated some other mode, you probably have one of these. In cogre-dot-mode which is named such to avoid conflict with other dot modes. Note it sets up comment-start and comment-start-skip - these are important for the lexer also. Also note the hook running at the end. Note the auto-mode-alist modification. Lastly, note the mode-local-parent stuff. That is setup to make sure that cogre-dot-mode is in agree with graphviz-dot-mode. You don't need anything like this if your mode is standalone. Next is wisent-dot.wy. At the beginning is the langauage-mode setting that matches, in this case, the core graphviz mode which I had to make optional. I think you did this correctly already. At the end after the %% is a lexer definition. This uses a bunch of default stuff, plus lexers defined in the language for keywords, etc. You can then compile this grammar into wisent-dot-wy.el. If you are in a compile debug cycle, you need to then enter wisent-dot-wy.el, and force eval with C-M-x several tables because the defvars carefully save old values if you just eval the buffer causing a no-op. :( Last is the key piece: wisent-dot.el Note that this pulls in wisent-dot-wy, plus wisent itself and any sources to functions you need to override. The override for semantic-tag-components is important to implement if you have ANY tags that are compound, such as a class with fields, etc. Note wisent-dot-setup-parser. It installs the parser using a function from wisent-dot-wy.el. That is how the parser gets pulled in. It also sets up the lexer, extra syntax mods needed, and a few other rndom things such as command separators and how to convert your tag classes into text strings. On the whole, the first statement and the first 2 variables are the most important. The rest is optional. Lastly are hooks to run the parser setup. These hooks can be replaced by adding the setup function to semantic-new-buffer-setup-functions. Feel free to start with the hook, and use the setup function when you want to make semantic support optional with your mode. If you already did all this, it could be that your parser is broken, or parser recompiles are not getting loaded in correctly. Fire up a new emacs and load your code and test it to avoid the recompile issue. If that helps, you need to hand load variable changes from generated files. Another good trick is to use semantic-show-parser-state-mode. this shows symbols in the mode line to tell you how the parser is doing. It will either refuse to start if the parser is not installed, or show % if the parser is broken, or if the buffer you are parsing is just not complete. Another fun one is semantic-highlight-edits-mode which shows how the buffer is edited and reparsed which is helpful if the incremental parser is broken with your language parser. Lastly use semantic-show-unmatched-syntax-mode to see if the parser is just tagging your whole buffer as unparsable. If this happens, you need to work on your parser some more. I hope this helps. Eric |
From: Left R. <ole...@gm...> - 2014-08-02 11:59:32
|
Hi Eric, Sorry it took me so long to reply. I was finally able to at least get the dot-mode to work. The way I managed was by requiring: (require 'cogre/dot-mode) (require 'cogre/wisent-dot) (require 'cogre/wisent-dot-wy) I also needed to update from Semantic bundled with Emacs 24.3.50 to the one I pulled from VCS today, otherwise, as I discovered post factum, it was trying to use a different parser (LR(1) instead of LL), I'm not sure how does this change come about, since the dot mode files didn't change across the versions. Yet when it was reading the grammar using LR parser, it would run into shift/reduce conflicts. I'm still struggling with my mode though, and, if you will be so kind, could you, please, explain few things about dot grammar? %type <punctuation> syntax "\\s.+" I searched high and low, but I can't find an exhaustive reference to Emacs-style regexp, therefore I can't tell for sure what does this regexp mean: but I came to believe that it means a single "whitespace" character followed by whatever. I can't understand the meaning of this line, despite reading the documentation: ---- begin quote ---- — %-Decl: %type <type-name> [property1 value1 ...] Explicitly declare a lexical type, and optionally give it properties. type-nameIs a symbol that identifies the type. propertyIs a property name, a valid Emacs Lisp symbol. valueIs a property value, a valid Emacs Lisp constant expression. Even if %token, %keyword, and precedence declarations can implicitly declare types, an explicit declaration is required for every type: To assign it properties. To auto-generate a lexical rule that detects tokens of this type. For more information, Auto-generation of lexical rules. ---- end quote ---- What does this declaration do? This looks suspiciously similar to the entries in syntax table, but then it doesn't make much sense, since Emacs has a different way to mark punctuation... Second: %token <block> BRACKET_BLOCK "(LBRACKET RBRACKET)" ---- begin quote ---- The %token statement declares a terminal symbol (a token) which is not a keyword. — %-Decl: %token [<type-name>] token-name match-value — %-Decl: %token [<type-name>] token-name1 ... Respectively declare one token with an optional type, and a match value, or several tokens with the optional same type, and no match value. type-nameIs an optional symbol, enclosed between < and >, that specifies (and implicitly declares) a type for this token (see type Decl). If omitted the token has no type. token-nameIs the terminal symbol used in grammar rules to represent this token. match-valueIs an optional string. Depending on type-name properties, it will be interpreted as an ordinary string, a regular expression, or have a more elaborate meaning. If omitted the match value will be nil, which means that this token will be considered as the default token of its type (see type Decl for more information). ---- end quote ---- The documentation speaks about some "more elaborate meaning". Can you tell, please, what is this meaning? The two things inside the parenthesis are another tokens which match literal brackets, but does this one match "[]" or "\\[[^\\]]+\\]"? Third: ;;; Bland default types %type <symbol> %token <symbol> symbol %type <string> %token <string> string %type <number> %token <number> number I understand what is this supposed to do, but I can't understand how it achieves that. Can you, please, interpret that in words? To me this looks like magic: how does token `number' know how to match numbers? PS. The links to Bison documentation in the online version are broken (they point to www.randomsample.de instead of www.gnu.org) Thanks, Oleg On Sat, Jul 12, 2014 at 4:40 AM, Eric M. Ludlam <er...@si...> wrote: > Hi Oleg, > > I'm not sure how to debug the fcns you posted below. I think they are ok. > Since you appear to be defining your own mode, let me instead annotate > through how the the parsing for "dot" works which is found in these files: > > lisp/cedet/cogre/dot-mode.el > lisp/cedet/cogre/wisent-dot.wy > lisp/cedet/cogre/wisent-dot.el > > and generated file > > lisp/cedet/cogre/wisent-dot-wy.el > > I picked this mode because it is pretty simple, just enough to get the > layout code of COGRE working. It is also not installed by default in > semantic-new-buffer-setup-functions. > > > Lets start in dot-mode.el: > > Note the syntax table. This part is critical for the lexer to work. If you > duplicated some other mode, you probably have one of these. > > In cogre-dot-mode which is named such to avoid conflict with other dot > modes. Note it sets up comment-start and comment-start-skip - these are > important for the lexer also. > > Also note the hook running at the end. > > Note the auto-mode-alist modification. > > Lastly, note the mode-local-parent stuff. That is setup to make sure that > cogre-dot-mode is in agree with graphviz-dot-mode. You don't need anything > like this if your mode is standalone. > > > Next is wisent-dot.wy. > > At the beginning is the langauage-mode setting that matches, in this case, > the core graphviz mode which I had to make optional. I think you did this > correctly already. > > At the end after the %% is a lexer definition. This uses a bunch of default > stuff, plus lexers defined in the language for keywords, etc. > > You can then compile this grammar into wisent-dot-wy.el. If you are in a > compile debug cycle, you need to then enter wisent-dot-wy.el, and force eval > with C-M-x several tables because the defvars carefully save old values if > you just eval the buffer causing a no-op. :( > > Last is the key piece: wisent-dot.el > > Note that this pulls in wisent-dot-wy, plus wisent itself and any sources to > functions you need to override. > > The override for semantic-tag-components is important to implement if you > have ANY tags that are compound, such as a class with fields, etc. > > Note wisent-dot-setup-parser. It installs the parser using a function from > wisent-dot-wy.el. That is how the parser gets pulled in. > > It also sets up the lexer, extra syntax mods needed, and a few other rndom > things such as command separators and how to convert your tag classes into > text strings. On the whole, the first statement and the first 2 variables > are the most important. The rest is optional. > > Lastly are hooks to run the parser setup. > > These hooks can be replaced by adding the setup function to > semantic-new-buffer-setup-functions. > > Feel free to start with the hook, and use the setup function when you want > to make semantic support optional with your mode. > > If you already did all this, it could be that your parser is broken, or > parser recompiles are not getting loaded in correctly. Fire up a new emacs > and load your code and test it to avoid the recompile issue. If that helps, > you need to hand load variable changes from generated files. > > Another good trick is to use semantic-show-parser-state-mode. this shows > symbols in the mode line to tell you how the parser is doing. It will > either refuse to start if the parser is not installed, or show % if the > parser is broken, or if the buffer you are parsing is just not complete. > > Another fun one is semantic-highlight-edits-mode which shows how the buffer > is edited and reparsed which is helpful if the incremental parser is broken > with your language parser. > > Lastly use semantic-show-unmatched-syntax-mode to see if the parser is just > tagging your whole buffer as unparsable. If this happens, you need to work > on your parser some more. > > I hope this helps. > Eric |
From: Left R. <ole...@gm...> - 2014-08-02 21:28:35
|
One more question, I'm trying to follow the inline code documentation, and here's something I came up with, but I have lots of questions about it: (define-lex-regex-analyzer fmt-lex-filler "Matches the filler in the format string." "[^~]+" (semantic-lex-push-token (semantic-lex-token 'filler (match-beginning 0) (match-end 0)))) (define-lex wisent-fmt-lexer "Lexical analyzer that handles Common Lisp format." fmt-lex-filler) 1. Using regular expression in this analyzer is a really, really bad idea (the proper regexp is more than 300 characters long, this one is here just for illustration), but this complexity can be easily avoided if instead of regular expression I could use a function that takes, say, position in the buffer or something like that: is that even possible? 2. 'filler isn't a default kind of token, is my guess correct that I can somehow refer to this kind in the grammar, similar to how %type <symbol> is defined, maybe? What would I need to do to make this possible? Thanks! Oleg On Sat, Aug 2, 2014 at 2:59 PM, Left Right <ole...@gm...> wrote: > Hi Eric, > > Sorry it took me so long to reply. I was finally able to at least get > the dot-mode to work. The way I managed was by requiring: > > (require 'cogre/dot-mode) > (require 'cogre/wisent-dot) > (require 'cogre/wisent-dot-wy) > > I also needed to update from Semantic bundled with Emacs 24.3.50 to > the one I pulled from VCS today, otherwise, as I discovered post > factum, it was trying to use a different parser (LR(1) instead of LL), > I'm not sure how does this change come about, since the dot mode files > didn't change across the versions. Yet when it was reading the grammar > using LR parser, it would run into shift/reduce conflicts. > > I'm still struggling with my mode though, and, if you will be so kind, > could you, please, explain few things about dot grammar? > > %type <punctuation> syntax "\\s.+" > > I searched high and low, but I can't find an exhaustive reference to > Emacs-style regexp, therefore I can't tell for sure what does this > regexp mean: but I came to believe that it means a single "whitespace" > character followed by whatever. I can't understand the meaning of this > line, despite reading the documentation: > > ---- begin quote ---- > > — %-Decl: %type <type-name> [property1 value1 ...] > > Explicitly declare a lexical type, and optionally give it properties. > > type-nameIs a symbol that identifies the type. > propertyIs a property name, a valid Emacs Lisp symbol. > valueIs a property value, a valid Emacs Lisp constant expression. > > Even if %token, %keyword, and precedence declarations can implicitly > declare types, an explicit declaration is required for every type: > > To assign it properties. > To auto-generate a lexical rule that detects tokens of this type. For > more information, Auto-generation of lexical rules. > > ---- end quote ---- > > What does this declaration do? This looks suspiciously similar to the > entries in syntax table, but then it doesn't make much sense, since > Emacs has a different way to mark punctuation... > > Second: > > %token <block> BRACKET_BLOCK "(LBRACKET RBRACKET)" > > ---- begin quote ---- > > The %token statement declares a terminal symbol (a token) which is not > a keyword. > > — %-Decl: %token [<type-name>] token-name match-value > — %-Decl: %token [<type-name>] token-name1 ... > > Respectively declare one token with an optional type, and a match > value, or several tokens with the optional same type, and no match > value. > > type-nameIs an optional symbol, enclosed between < and >, that > specifies (and implicitly declares) a type for this token (see type > Decl). If omitted the token has no type. > token-nameIs the terminal symbol used in grammar rules to represent this token. > match-valueIs an optional string. Depending on type-name properties, > it will be interpreted as an ordinary string, a regular expression, or > have a more elaborate meaning. If omitted the match value will be nil, > which means that this token will be considered as the default token of > its type (see type Decl for more information). > > ---- end quote ---- > > The documentation speaks about some "more elaborate meaning". Can you > tell, please, what is this meaning? The two things inside the > parenthesis are another tokens which match literal brackets, but does > this one match "[]" or "\\[[^\\]]+\\]"? > > Third: > > ;;; Bland default types > %type <symbol> > %token <symbol> symbol > > %type <string> > %token <string> string > > %type <number> > %token <number> number > > I understand what is this supposed to do, but I can't understand how > it achieves that. Can you, please, interpret that in words? To me this > looks like magic: how does token `number' know how to match numbers? > > PS. The links to Bison documentation in the online version are broken > (they point to www.randomsample.de instead of www.gnu.org) > > Thanks, > > Oleg > > On Sat, Jul 12, 2014 at 4:40 AM, Eric M. Ludlam <er...@si...> wrote: >> Hi Oleg, >> >> I'm not sure how to debug the fcns you posted below. I think they are ok. >> Since you appear to be defining your own mode, let me instead annotate >> through how the the parsing for "dot" works which is found in these files: >> >> lisp/cedet/cogre/dot-mode.el >> lisp/cedet/cogre/wisent-dot.wy >> lisp/cedet/cogre/wisent-dot.el >> >> and generated file >> >> lisp/cedet/cogre/wisent-dot-wy.el >> >> I picked this mode because it is pretty simple, just enough to get the >> layout code of COGRE working. It is also not installed by default in >> semantic-new-buffer-setup-functions. >> >> >> Lets start in dot-mode.el: >> >> Note the syntax table. This part is critical for the lexer to work. If you >> duplicated some other mode, you probably have one of these. >> >> In cogre-dot-mode which is named such to avoid conflict with other dot >> modes. Note it sets up comment-start and comment-start-skip - these are >> important for the lexer also. >> >> Also note the hook running at the end. >> >> Note the auto-mode-alist modification. >> >> Lastly, note the mode-local-parent stuff. That is setup to make sure that >> cogre-dot-mode is in agree with graphviz-dot-mode. You don't need anything >> like this if your mode is standalone. >> >> >> Next is wisent-dot.wy. >> >> At the beginning is the langauage-mode setting that matches, in this case, >> the core graphviz mode which I had to make optional. I think you did this >> correctly already. >> >> At the end after the %% is a lexer definition. This uses a bunch of default >> stuff, plus lexers defined in the language for keywords, etc. >> >> You can then compile this grammar into wisent-dot-wy.el. If you are in a >> compile debug cycle, you need to then enter wisent-dot-wy.el, and force eval >> with C-M-x several tables because the defvars carefully save old values if >> you just eval the buffer causing a no-op. :( >> >> Last is the key piece: wisent-dot.el >> >> Note that this pulls in wisent-dot-wy, plus wisent itself and any sources to >> functions you need to override. >> >> The override for semantic-tag-components is important to implement if you >> have ANY tags that are compound, such as a class with fields, etc. >> >> Note wisent-dot-setup-parser. It installs the parser using a function from >> wisent-dot-wy.el. That is how the parser gets pulled in. >> >> It also sets up the lexer, extra syntax mods needed, and a few other rndom >> things such as command separators and how to convert your tag classes into >> text strings. On the whole, the first statement and the first 2 variables >> are the most important. The rest is optional. >> >> Lastly are hooks to run the parser setup. >> >> These hooks can be replaced by adding the setup function to >> semantic-new-buffer-setup-functions. >> >> Feel free to start with the hook, and use the setup function when you want >> to make semantic support optional with your mode. >> >> If you already did all this, it could be that your parser is broken, or >> parser recompiles are not getting loaded in correctly. Fire up a new emacs >> and load your code and test it to avoid the recompile issue. If that helps, >> you need to hand load variable changes from generated files. >> >> Another good trick is to use semantic-show-parser-state-mode. this shows >> symbols in the mode line to tell you how the parser is doing. It will >> either refuse to start if the parser is not installed, or show % if the >> parser is broken, or if the buffer you are parsing is just not complete. >> >> Another fun one is semantic-highlight-edits-mode which shows how the buffer >> is edited and reparsed which is helpful if the incremental parser is broken >> with your language parser. >> >> Lastly use semantic-show-unmatched-syntax-mode to see if the parser is just >> tagging your whole buffer as unparsable. If this happens, you need to work >> on your parser some more. >> >> I hope this helps. >> Eric |
From: Eric M. L. <er...@si...> - 2014-08-10 14:41:28
|
On 08/02/2014 05:28 PM, Left Right wrote: > One more question, I'm trying to follow the inline code documentation, > and here's something I came up with, but I have lots of questions > about it: > > (define-lex-regex-analyzer fmt-lex-filler > "Matches the filler in the format string." > "[^~]+" > (semantic-lex-push-token > (semantic-lex-token > 'filler (match-beginning 0) (match-end 0)))) > > (define-lex wisent-fmt-lexer > "Lexical analyzer that handles Common Lisp format." > fmt-lex-filler) > > 1. Using regular expression in this analyzer is a really, really bad > idea (the proper regexp is more than 300 characters long, this one is > here just for illustration), but this complexity can be easily avoided > if instead of regular expression I could use a function that takes, > say, position in the buffer or something like that: is that even > possible? > > 2. 'filler isn't a default kind of token, is my guess correct that I > can somehow refer to this kind in the grammar, similar to how %type > <symbol> is defined, maybe? What would I need to do to make this > possible? There is a default whitespace token you can create from your lexers. For exmaple, the dot lexer starts with these: semantic-lex-ignore-whitespace semantic-lex-ignore-newline semantic-lex-ignore-comments which is implemented like this: (define-lex-regex-analyzer semantic-lex-ignore-whitespace "Detect and skip over whitespace tokens." ;; catch whitespace when needed "\\s-+" ;; Skip over the detected whitespace, do not create a token for it. (setq semantic-lex-end-point (match-end 0))) which means "go to the end of the match, and don't return a token. As you have in your lexer, you have to push the 'filler token to get it on the stack. The reason you have to set the end point is because when you push a token, it looks at the end of your token, and moves there automatically, but if you don't push a token, you have to move it by hand. Lexical analyzers are interesting, in that while a function is made for them, those functions aren't used. Instead they also have a value, and those values are concatenated together to create the master lexer function. Like a big cond statement. The main lexer has logic it applies after each match is found, and that is where a bunch of the magic happens. If you aren't trying to ignore your 'filler tokens, you will instead need a %token declaration for it, such as: %token filler If you instead had %type<filler> syntax "[^~]+" you wouldn't need to write your lexical analyzer at all and one would be provided for you. (I think, I'm a little fuzzy on that one.) Your filler lexer is OK if it is something you really need, but because it can match so much, you MUST put it at the END of your defined lexer. That way you will be able to match all your other expressions, and if nothing works, you call it filler. use: M-x semantic-lex-test RET to see how it works, or M-x semantic-lex-debug RET to watch your lexer run. Good Luck Eric |
From: Left R. <ole...@gm...> - 2014-08-16 22:12:29
|
Hi and thanks for thorough replies. I think I will need to go through them again, but so far I have one questions, which may possibly spare me trying to intern all of this material. My last question wrt define-lex-regex wasn't about how could I make a regular expression based lexer. The truth is: I don't want any regular expressions there, they are just not cut for the task. So, let me rephrase it: can I ditch the whole mechanism of the lexer and replace it with a brand new code, which will handle the tokenization? What will I need to do to achieve this? I don't have much experience with writing lexers, in fact, I only ever used cl-yacc, which simply leaves it to the programmer to implement a lexer and only requires a very simple interface: a function that accepts an input stream and returns a token. The reason I'm asking: the tokenizer mechanism looks very complex to me, way more complex than what I need, besides, it works in a very inconvenient way: if I could keep the state between the calls to the lexer it would make my life so much easier. I also don't want to depend on syntax table and whatever bizarre rules Emacs uses to understand the syntax: this is not your fault, but I have an impression that many of these rules are there purely by accident, they are hard to discover and even harder to understand, because they don't match anything you might come to expect from a lexer / parser. It would be just a whole lot easier to start fresh, than to try to glue together a bunch of jigsaw puzzle pieces, clearly taken from different puzzles. I can understand the motivation for someone who gets the syntax table and mode coloring for free from an existing mode and wants to reuse it in order to build the lexer. I don't have these preconditions, and even the little that I do have, I've written myself, and I'd rather give it up to make the overall process more consistent. I.e. I don't want to have to design font-lock rules, syntax table and the lexer separately: to me this would be like doing the same work twice, but both times using inappropriate tools. Best, Oleg On Sun, Aug 10, 2014 at 5:41 PM, Eric M. Ludlam <er...@si...> wrote: > On 08/02/2014 05:28 PM, Left Right wrote: >> >> One more question, I'm trying to follow the inline code documentation, >> and here's something I came up with, but I have lots of questions >> about it: >> >> (define-lex-regex-analyzer fmt-lex-filler >> "Matches the filler in the format string." >> "[^~]+" >> (semantic-lex-push-token >> (semantic-lex-token >> 'filler (match-beginning 0) (match-end 0)))) >> >> (define-lex wisent-fmt-lexer >> "Lexical analyzer that handles Common Lisp format." >> fmt-lex-filler) >> >> 1. Using regular expression in this analyzer is a really, really bad >> idea (the proper regexp is more than 300 characters long, this one is >> here just for illustration), but this complexity can be easily avoided >> if instead of regular expression I could use a function that takes, >> say, position in the buffer or something like that: is that even >> possible? >> >> 2. 'filler isn't a default kind of token, is my guess correct that I >> can somehow refer to this kind in the grammar, similar to how %type >> <symbol> is defined, maybe? What would I need to do to make this >> possible? > > > There is a default whitespace token you can create from your lexers. For > exmaple, the dot lexer starts with these: > > semantic-lex-ignore-whitespace > semantic-lex-ignore-newline > semantic-lex-ignore-comments > > which is implemented like this: > > (define-lex-regex-analyzer semantic-lex-ignore-whitespace > "Detect and skip over whitespace tokens." > ;; catch whitespace when needed > "\\s-+" > ;; Skip over the detected whitespace, do not create a token for it. > (setq semantic-lex-end-point (match-end 0))) > > > which means "go to the end of the match, and don't return a token. As you > have in your lexer, you have to push the 'filler token to get it on the > stack. > > The reason you have to set the end point is because when you push a token, > it looks at the end of your token, and moves there automatically, but if you > don't push a token, you have to move it by hand. > > Lexical analyzers are interesting, in that while a function is made for > them, those functions aren't used. Instead they also have a value, and > those values are concatenated together to create the master lexer function. > Like a big cond statement. The main lexer has logic it applies after each > match is found, and that is where a bunch of the magic happens. > > If you aren't trying to ignore your 'filler tokens, you will instead need a > %token declaration for it, such as: > > %token filler > > If you instead had > > %type<filler> syntax "[^~]+" > > you wouldn't need to write your lexical analyzer at all and one would be > provided for you. (I think, I'm a little fuzzy on that one.) > > Your filler lexer is OK if it is something you really need, but because it > can match so much, you MUST put it at the END of your defined lexer. That > way you will be able to match all your other expressions, and if nothing > works, you call it filler. > > use: > > M-x semantic-lex-test RET > > to see how it works, or > > M-x semantic-lex-debug RET > > to watch your lexer run. > > Good Luck > Eric |
From: Left R. <ole...@gm...> - 2014-08-16 22:52:16
|
Just to give you a sense of what I /don't/ want to have in my code (below is my own code, so I'm allowed to say that it's an unmaintainable cuneiform) (defvar fmt-font-lock-keywords ;; no-args `(("~\\(@:?\\|:@?\\)?[]>()}aswvcp;_]" (0 font-lock-keyword-face)) ;; numeric-arg ("~\\([0-9]*\\|#,?\\)\\(@:?\\|:@?\\)?[i*%&|~{[]" (0 font-lock-keyword-face)) ;; decimal ("~\\([0-9]*\\|#\\(,[0-9]*\\|#\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?[rdbox]" (0 font-lock-keyword-face)) ;; floating-point f (,(concat "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,2\\}\\)\\|" "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)" "?\\(@:?\\|:@?\\)?f") (0 font-lock-keyword-face)) ;; floating-point e, g (,(concat "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,3\\}\\)\\|" "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)" "?\\(@:?\\|:@?\\)?[eg]") (0 font-lock-keyword-face)) ;; currency (,(concat "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{2\\}\\(,'\\w\\)\\)\\|" "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)\\)" "?\\(@:?\\|:@?\\)?[$]") (0 font-lock-keyword-face)) ;; tabulation ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)?\\)?\\(@:?\\|:@?\\)?t" (0 font-lock-keyword-face)) ;; escape ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)?\\(@:?\\|:@?\\)?^" (0 font-lock-keyword-face)) ;; logical block ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?<" (0 font-lock-keyword-face)) ;; custom function (,(concat "~\\(\\([0-9]+\\|'\\w\\|#\\)\\(,\\([0-9]+\\|'\\w\\|#\\)+\\)*\\)?" "\\(@:?\\|:@?\\)?\\/[^\\s\\n,#@]+\\/") (0 font-lock-keyword-face)))) This is my previous version of font-lock coloring. I don't expect you to read through it, but just to make the point even more obvious: this is actually a single regular expression, which I chopped into pieces for "ease" of use. The lexer based on regexp would need to have this mess concatenated into a single expression. Maybe it can be simplified, but not by much. The corresponding parsing function, which doesn't use regular expressions would be somewhere between 1/3 and 1/2 of the above code, and it would be perfectly understandable. This is the case similar to email parsing: you can do it with regular grammar, in principle, but there is no good way to do it in practice. I later found that I can provide a function to font-lock to replace this mess, and I would be happy if there was a way to do the same in place of the Semantic lexer. Best, Oleg On Sun, Aug 17, 2014 at 1:12 AM, Left Right <ole...@gm...> wrote: > Hi and thanks for thorough replies. I think I will need to go through > them again, but so far I have one questions, which may possibly spare > me trying to intern all of this material. > > My last question wrt define-lex-regex wasn't about how could I make a > regular expression based lexer. The truth is: I don't want any regular > expressions there, they are just not cut for the task. So, let me > rephrase it: can I ditch the whole mechanism of the lexer and replace > it with a brand new code, which will handle the tokenization? What > will I need to do to achieve this? I don't have much experience with > writing lexers, in fact, I only ever used cl-yacc, which simply leaves > it to the programmer to implement a lexer and only requires a very > simple interface: a function that accepts an input stream and returns > a token. > > The reason I'm asking: the tokenizer mechanism looks very complex to > me, way more complex than what I need, besides, it works in a very > inconvenient way: if I could keep the state between the calls to the > lexer it would make my life so much easier. I also don't want to > depend on syntax table and whatever bizarre rules Emacs uses to > understand the syntax: this is not your fault, but I have an > impression that many of these rules are there purely by accident, they > are hard to discover and even harder to understand, because they don't > match anything you might come to expect from a lexer / parser. It > would be just a whole lot easier to start fresh, than to try to glue > together a bunch of jigsaw puzzle pieces, clearly taken from different > puzzles. > > I can understand the motivation for someone who gets the syntax table > and mode coloring for free from an existing mode and wants to reuse it > in order to build the lexer. I don't have these preconditions, and > even the little that I do have, I've written myself, and I'd rather > give it up to make the overall process more consistent. I.e. I don't > want to have to design font-lock rules, syntax table and the lexer > separately: to me this would be like doing the same work twice, but > both times using inappropriate tools. > > Best, > > Oleg > > On Sun, Aug 10, 2014 at 5:41 PM, Eric M. Ludlam <er...@si...> wrote: >> On 08/02/2014 05:28 PM, Left Right wrote: >>> >>> One more question, I'm trying to follow the inline code documentation, >>> and here's something I came up with, but I have lots of questions >>> about it: >>> >>> (define-lex-regex-analyzer fmt-lex-filler >>> "Matches the filler in the format string." >>> "[^~]+" >>> (semantic-lex-push-token >>> (semantic-lex-token >>> 'filler (match-beginning 0) (match-end 0)))) >>> >>> (define-lex wisent-fmt-lexer >>> "Lexical analyzer that handles Common Lisp format." >>> fmt-lex-filler) >>> >>> 1. Using regular expression in this analyzer is a really, really bad >>> idea (the proper regexp is more than 300 characters long, this one is >>> here just for illustration), but this complexity can be easily avoided >>> if instead of regular expression I could use a function that takes, >>> say, position in the buffer or something like that: is that even >>> possible? >>> >>> 2. 'filler isn't a default kind of token, is my guess correct that I >>> can somehow refer to this kind in the grammar, similar to how %type >>> <symbol> is defined, maybe? What would I need to do to make this >>> possible? >> >> >> There is a default whitespace token you can create from your lexers. For >> exmaple, the dot lexer starts with these: >> >> semantic-lex-ignore-whitespace >> semantic-lex-ignore-newline >> semantic-lex-ignore-comments >> >> which is implemented like this: >> >> (define-lex-regex-analyzer semantic-lex-ignore-whitespace >> "Detect and skip over whitespace tokens." >> ;; catch whitespace when needed >> "\\s-+" >> ;; Skip over the detected whitespace, do not create a token for it. >> (setq semantic-lex-end-point (match-end 0))) >> >> >> which means "go to the end of the match, and don't return a token. As you >> have in your lexer, you have to push the 'filler token to get it on the >> stack. >> >> The reason you have to set the end point is because when you push a token, >> it looks at the end of your token, and moves there automatically, but if you >> don't push a token, you have to move it by hand. >> >> Lexical analyzers are interesting, in that while a function is made for >> them, those functions aren't used. Instead they also have a value, and >> those values are concatenated together to create the master lexer function. >> Like a big cond statement. The main lexer has logic it applies after each >> match is found, and that is where a bunch of the magic happens. >> >> If you aren't trying to ignore your 'filler tokens, you will instead need a >> %token declaration for it, such as: >> >> %token filler >> >> If you instead had >> >> %type<filler> syntax "[^~]+" >> >> you wouldn't need to write your lexical analyzer at all and one would be >> provided for you. (I think, I'm a little fuzzy on that one.) >> >> Your filler lexer is OK if it is something you really need, but because it >> can match so much, you MUST put it at the END of your defined lexer. That >> way you will be able to match all your other expressions, and if nothing >> works, you call it filler. >> >> use: >> >> M-x semantic-lex-test RET >> >> to see how it works, or >> >> M-x semantic-lex-debug RET >> >> to watch your lexer run. >> >> Good Luck >> Eric |
From: Eric M. L. <er...@si...> - 2014-08-10 14:26:08
|
On 08/02/2014 07:59 AM, Left Right wrote: > Hi Eric, > > Sorry it took me so long to reply. I was finally able to at least get > the dot-mode to work. The way I managed was by requiring: > > (require 'cogre/dot-mode) > (require 'cogre/wisent-dot) > (require 'cogre/wisent-dot-wy) > > I also needed to update from Semantic bundled with Emacs 24.3.50 to > the one I pulled from VCS today, otherwise, as I discovered post > factum, it was trying to use a different parser (LR(1) instead of LL), > I'm not sure how does this change come about, since the dot mode files > didn't change across the versions. Yet when it was reading the grammar > using LR parser, it would run into shift/reduce conflicts. Hi Oleg, My setup for CEDET in my .emacs is basically the same as in the INSTALL file with the version of CEDET from BZR, and that will load up .dot files just fine. It is surprising to me you need all the extra loads. Perhaps the build didn't create the autoload files for you? > I'm still struggling with my mode though, and, if you will be so kind, > could you, please, explain few things about dot grammar? > > %type<punctuation> syntax "\\s.+" In this case \s means "match a syntax type", and the "." means the syntax code for punctuation. The \\ is quoting in one slash. Here's a doc snippet: `\sCODE' matches any character whose syntax is CODE. Here CODE is a character that represents a syntax code: thus, `w' for word constituent, `-' for whitespace, `(' for open parenthesis, etc. To represent whitespace syntax, use either `-' or a space character. *Note Syntax Class Table::, for a list of syntax codes and the characters that stand for them. so the whole statement is "Create lexical tokens of type punctuation that matches the regular expression of punctuation from the Emacs syntax table. In otherwords, it is a statement translating from Emacs speak to lexer speak. > I searched high and low, but I can't find an exhaustive reference to > Emacs-style regexp, therefore I can't tell for sure what does this > regexp mean: but I came to believe that it means a single "whitespace" > character followed by whatever. I can't understand the meaning of this > line, despite reading the documentation: There is a doc node in the "Elisp" manual called "Syntax of Regular Expressions" that I use. > ---- begin quote ---- > > — %-Decl: %type<type-name> [property1 value1 ...] > > Explicitly declare a lexical type, and optionally give it properties. > > type-nameIs a symbol that identifies the type. This would be a type for the lexer. > propertyIs a property name, a valid Emacs Lisp symbol. > valueIs a property value, a valid Emacs Lisp constant expression. This lets you specify that the syntax (the property) matches some regexp. If you leave it blank there are some handy defaults. > Even if %token, %keyword, and precedence declarations can implicitly > declare types, an explicit declaration is required for every type: > > To assign it properties. > To auto-generate a lexical rule that detects tokens of this type. For > more information, Auto-generation of lexical rules. > > ---- end quote ---- > > What does this declaration do? This looks suspiciously similar to the > entries in syntax table, but then it doesn't make much sense, since > Emacs has a different way to mark punctuation... > > Second: > > %token<block> BRACKET_BLOCK "(LBRACKET RBRACKET)" > > ---- begin quote ---- Once you have a lexical %type you can create %tokens that are more specific. For example you might say" %type <punctuation> syntax "\\s." to match a single punctuation, and then say %token <punctuation> PLUS "+" to create a token you can use in your grammar called plus. This two step process lets the lexer quickly find your punctuation, and then convert generic punctuation into handy named tokens for use in your grammar. <block> tokens are special in that the Emacs syntax table supports block concepts, and we use blocks to speed up grammar parsing. While unusual in grammars, it lets us parse buffers more quickly by skipping over large chunks of text. Thus the combination of: %type <block> %token <block> BRACKET_BLOCK "(LBRACKET RBRACKET)" %token <open-paren> LBRACKET "[" %token <close-paren> RBRACKET "]" Says "I have a %type in my lexer called block". I can create a <block> token that is composed of the LBRACKET and RBRACKET. I have an <open-parent> lexical type called LBRACKET which matches [. Then the lexer has a special 'depth' parameter, and if set to 0, will return BRACKET_BLOCK. IN the match of BRACKET_BLOCK you can expand, and then get the LBRACKET token, like this: graphgeneric : GRAPH BRACKET_BLOCK SEMI (TAG "GRAPH" 'generic-graph :attributes (EXPANDFULL $2 attribute-block)) ; where EXPANDFULL on $2 says "run this grammar again on the buffer contents inside $2 (the BRACKET_BLOCK) starting with the grammar symbol "attribute-block". The lexer will be run on that block with a depth of 1, forcing it to look inside the parens (or brackets). attribute-block : LBRACKET () | RBRACKET () | COMMA () ;; This is a catch-all in case we miss some keyword. | symbol EQUAL name (TAG $1 'attribute :value $3) ; so now in bracket block, we match the brackets, commas, etc, and just start creating tags for each attribute name found. This set of nested tags needs to be matched outside of the dot grammar with a function for expanding tags. In wisent-dot.el you will find semantic-tag-components which matches 'generic-graph from the first rule, and returns the :attributes which is the list of tags created with attribute-block. > The %token statement declares a terminal symbol (a token) which is not > a keyword. > > — %-Decl: %token [<type-name>] token-name match-value > — %-Decl: %token [<type-name>] token-name1 ... > > Respectively declare one token with an optional type, and a match > value, or several tokens with the optional same type, and no match > value. > > type-nameIs an optional symbol, enclosed between< and>, that > specifies (and implicitly declares) a type for this token (see type > Decl). If omitted the token has no type. > token-nameIs the terminal symbol used in grammar rules to represent this token. > match-valueIs an optional string. Depending on type-name properties, > it will be interpreted as an ordinary string, a regular expression, or > have a more elaborate meaning. If omitted the match value will be nil, > which means that this token will be considered as the default token of > its type (see type Decl for more information). > > ---- end quote ---- > > The documentation speaks about some "more elaborate meaning". Can you > tell, please, what is this meaning? The two things inside the > parenthesis are another tokens which match literal brackets, but does > this one match "[]" or "\\[[^\\]]+\\]"? You are now pushing the boundary of what I am familiar with as I didn't develop most of this system. Perhaps my earlier examples helped? > Third: > > ;;; Bland default types > %type<symbol> > %token<symbol> symbol > > %type<string> > %token<string> string > > %type<number> > %token<number> number > > I understand what is this supposed to do, but I can't understand how > it achieves that. Can you, please, interpret that in words? To me this > looks like magic: how does token `number' know how to match numbers? In the wisent-dot example, the grammar code: ;;; Bland default types %type <symbol> %token <symbol> symbol is matched with: (define-lex wisent-dot-lexer "Lexical analyzer that handles DOT buffers. It ignores whitespace, newlines and comments." ... wisent-dot-wy--<symbol>-regexp-analyzer and there is code generated like this in wisent-dot-wy.el (define-lex-regex-type-analyzer wisent-dot-wy--<symbol>-regexp-analyzer "regexp analyzer for <symbol> tokens." "\\(\\sw\\|\\s_\\)+" nil 'symbol) so basically, there are default regular expressions for many types like symbol that will autogenerate lexer pieces. You still need to assemble your lexer by ordering the pieces from most specific to most generic. This is mostly derived from the fact that Emacs has a built-in lexer-like thing created using syntax tables. A good major-mode defines a good syntax table, and then the lexer can be very simple, basically matching up syntax types via \\s to lexical types needed by the parser. You can then overlay more specific token types on top of those. By using the syntax table, the semantic lexer takes advantage of the Emacs scanners built in C, and can go very fast. I hope that helps. Eric |
From: Eric M. L. <er...@si...> - 2014-08-17 13:07:55
|
Hi Oleg, Replacing the lexer is pretty easy. In most languages (I'll use dot again) there is a line that looks like this: (setq ;; Lexical Analysis semantic-lex-analyzer 'wisent-dot-lexer ... which was created like this: (define-lex wisent-dot-lexer "Lexical analyzer that handles DOT buffers. It ignores whitespace, newlines and comments." semantic-lex-ignore-whitespace ... If you follow the doc trail, you end up with this: ----------- semantic-lex is an autoloaded compiled Lisp function in `lex.el'. (semantic-lex START END &optional DEPTH LENGTH) Lexically analyze text in the current buffer between START and END. Optional argument DEPTH indicates at what level to scan over entire lists. The last argument, LENGTH specifies that `semantic-lex' should only return LENGTH tokens. The return value is a token stream. Each element is a list, such of the form (symbol start-expression . end-expression) where SYMBOL denotes the token type. See `semantic-lex-tokens' variable for details on token types. END does not mark the end of the text scanned, only the end of the beginning of text scanned. Thus, if a string extends past END, the end of the return token will be larger than END. To truly restrict scanning, use `narrow-to-region'. ---------- So this function will parse an entire buffer and return all the lexical tokens for it. You can put anything you want in there, and return tokens with any old SYMBOL you want too. Semantic's first lexer (see semantic-flex) is an example of a different standalone lexer. It was only after struggling with that for a while that the mechanism for making customer lexers came up. It is partly modeled after lex/flex where regexp are associated with actions. If regexp don't make sense for your language, then rolling your own is no problem. You still need to add %token expressions in your grammar to let the grammar know what is going on though. You just don't need to specify all the regexp along the way, or use the automatically generated lexers. Your big scary regexp below is not really necessary for writing lexers using the lexing technique I described last time though. Each expression only needs to be as long as the small piece you are looking at (ie - a number or symbol). If you find that your language has a lexical token whose type is based on previous lexical tokens, then you are right that the built in lexer is probably insufficient. The C parser has examples that parse out things like: #include <foo.h> and #if SOMESYMBOL #endif that way, but it gets pretty hairy. Eric On 08/16/2014 06:52 PM, Left Right wrote: > Just to give you a sense of what I /don't/ want to have in my code > (below is my own code, so I'm allowed to say that it's an > unmaintainable cuneiform) > > (defvar fmt-font-lock-keywords > ;; no-args > `(("~\\(@:?\\|:@?\\)?[]>()}aswvcp;_]" > (0 font-lock-keyword-face)) > ;; numeric-arg > ("~\\([0-9]*\\|#,?\\)\\(@:?\\|:@?\\)?[i*%&|~{[]" > (0 font-lock-keyword-face)) > ;; decimal > ("~\\([0-9]*\\|#\\(,[0-9]*\\|#\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?[rdbox]" > (0 font-lock-keyword-face)) > ;; floating-point f > (,(concat > "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,2\\}\\)\\|" > "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)" > "?\\(@:?\\|:@?\\)?f") > (0 font-lock-keyword-face)) > ;; floating-point e, g > (,(concat > "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,3\\}\\)\\|" > "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)" > "?\\(@:?\\|:@?\\)?[eg]") > (0 font-lock-keyword-face)) > ;; currency > (,(concat > "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{2\\}\\(,'\\w\\)\\)\\|" > "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)\\)" > "?\\(@:?\\|:@?\\)?[$]") > (0 font-lock-keyword-face)) > ;; tabulation > ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)?\\)?\\(@:?\\|:@?\\)?t" > (0 font-lock-keyword-face)) > ;; escape > ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)?\\(@:?\\|:@?\\)?^" > (0 font-lock-keyword-face)) > ;; logical block > ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?<" > (0 font-lock-keyword-face)) > ;; custom function > (,(concat > "~\\(\\([0-9]+\\|'\\w\\|#\\)\\(,\\([0-9]+\\|'\\w\\|#\\)+\\)*\\)?" > "\\(@:?\\|:@?\\)?\\/[^\\s\\n,#@]+\\/") > (0 font-lock-keyword-face)))) > > This is my previous version of font-lock coloring. I don't expect you > to read through it, but just to make the point even more obvious: this > is actually a single regular expression, which I chopped into pieces > for "ease" of use. The lexer based on regexp would need to have this > mess concatenated into a single expression. Maybe it can be > simplified, but not by much. The corresponding parsing function, which > doesn't use regular expressions would be somewhere between 1/3 and 1/2 > of the above code, and it would be perfectly understandable. This is > the case similar to email parsing: you can do it with regular grammar, > in principle, but there is no good way to do it in practice. > > I later found that I can provide a function to font-lock to replace > this mess, and I would be happy if there was a way to do the same in > place of the Semantic lexer. > > Best, > > Oleg > > On Sun, Aug 17, 2014 at 1:12 AM, Left Right<ole...@gm...> wrote: >> Hi and thanks for thorough replies. I think I will need to go through >> them again, but so far I have one questions, which may possibly spare >> me trying to intern all of this material. >> >> My last question wrt define-lex-regex wasn't about how could I make a >> regular expression based lexer. The truth is: I don't want any regular >> expressions there, they are just not cut for the task. So, let me >> rephrase it: can I ditch the whole mechanism of the lexer and replace >> it with a brand new code, which will handle the tokenization? What >> will I need to do to achieve this? I don't have much experience with >> writing lexers, in fact, I only ever used cl-yacc, which simply leaves >> it to the programmer to implement a lexer and only requires a very >> simple interface: a function that accepts an input stream and returns >> a token. >> >> The reason I'm asking: the tokenizer mechanism looks very complex to >> me, way more complex than what I need, besides, it works in a very >> inconvenient way: if I could keep the state between the calls to the >> lexer it would make my life so much easier. I also don't want to >> depend on syntax table and whatever bizarre rules Emacs uses to >> understand the syntax: this is not your fault, but I have an >> impression that many of these rules are there purely by accident, they >> are hard to discover and even harder to understand, because they don't >> match anything you might come to expect from a lexer / parser. It >> would be just a whole lot easier to start fresh, than to try to glue >> together a bunch of jigsaw puzzle pieces, clearly taken from different >> puzzles. >> >> I can understand the motivation for someone who gets the syntax table >> and mode coloring for free from an existing mode and wants to reuse it >> in order to build the lexer. I don't have these preconditions, and >> even the little that I do have, I've written myself, and I'd rather >> give it up to make the overall process more consistent. I.e. I don't >> want to have to design font-lock rules, syntax table and the lexer >> separately: to me this would be like doing the same work twice, but >> both times using inappropriate tools. >> >> Best, >> >> Oleg >> >> On Sun, Aug 10, 2014 at 5:41 PM, Eric M. Ludlam<er...@si...> wrote: >>> On 08/02/2014 05:28 PM, Left Right wrote: >>>> >>>> One more question, I'm trying to follow the inline code documentation, >>>> and here's something I came up with, but I have lots of questions >>>> about it: >>>> >>>> (define-lex-regex-analyzer fmt-lex-filler >>>> "Matches the filler in the format string." >>>> "[^~]+" >>>> (semantic-lex-push-token >>>> (semantic-lex-token >>>> 'filler (match-beginning 0) (match-end 0)))) >>>> >>>> (define-lex wisent-fmt-lexer >>>> "Lexical analyzer that handles Common Lisp format." >>>> fmt-lex-filler) >>>> >>>> 1. Using regular expression in this analyzer is a really, really bad >>>> idea (the proper regexp is more than 300 characters long, this one is >>>> here just for illustration), but this complexity can be easily avoided >>>> if instead of regular expression I could use a function that takes, >>>> say, position in the buffer or something like that: is that even >>>> possible? >>>> >>>> 2. 'filler isn't a default kind of token, is my guess correct that I >>>> can somehow refer to this kind in the grammar, similar to how %type >>>> <symbol> is defined, maybe? What would I need to do to make this >>>> possible? >>> >>> >>> There is a default whitespace token you can create from your lexers. For >>> exmaple, the dot lexer starts with these: >>> >>> semantic-lex-ignore-whitespace >>> semantic-lex-ignore-newline >>> semantic-lex-ignore-comments >>> >>> which is implemented like this: >>> >>> (define-lex-regex-analyzer semantic-lex-ignore-whitespace >>> "Detect and skip over whitespace tokens." >>> ;; catch whitespace when needed >>> "\\s-+" >>> ;; Skip over the detected whitespace, do not create a token for it. >>> (setq semantic-lex-end-point (match-end 0))) >>> >>> >>> which means "go to the end of the match, and don't return a token. As you >>> have in your lexer, you have to push the 'filler token to get it on the >>> stack. >>> >>> The reason you have to set the end point is because when you push a token, >>> it looks at the end of your token, and moves there automatically, but if you >>> don't push a token, you have to move it by hand. >>> >>> Lexical analyzers are interesting, in that while a function is made for >>> them, those functions aren't used. Instead they also have a value, and >>> those values are concatenated together to create the master lexer function. >>> Like a big cond statement. The main lexer has logic it applies after each >>> match is found, and that is where a bunch of the magic happens. >>> >>> If you aren't trying to ignore your 'filler tokens, you will instead need a >>> %token declaration for it, such as: >>> >>> %token filler >>> >>> If you instead had >>> >>> %type<filler> syntax "[^~]+" >>> >>> you wouldn't need to write your lexical analyzer at all and one would be >>> provided for you. (I think, I'm a little fuzzy on that one.) >>> >>> Your filler lexer is OK if it is something you really need, but because it >>> can match so much, you MUST put it at the END of your defined lexer. That >>> way you will be able to match all your other expressions, and if nothing >>> works, you call it filler. >>> >>> use: >>> >>> M-x semantic-lex-test RET >>> >>> to see how it works, or >>> >>> M-x semantic-lex-debug RET >>> >>> to watch your lexer run. >>> >>> Good Luck >>> Eric > |
From: Left R. <ole...@gm...> - 2014-08-17 14:52:33
|
Thanks, I'll try it this weekend. Re' > Your big scary regexp below is not really necessary for writing lexers using the lexing technique I described last time though. Each expression only needs to be as long as the small piece you are looking at (ie - a number or symbol). If you find that your language has a lexical token whose type is based on previous lexical tokens, then you are right that the built in lexer is probably insufficient. Lo and behold, that huge regexp matches things like "~^", "~10,20,'xf", "~#[" and so on. Most of which are two or three characters long! It's long not because the text it matches is long, it's long because it's difficult to match the text precisely (You probably saw this already, but if not, you might enjoy a good laugh: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html ). Best, Oleg On Sun, Aug 17, 2014 at 4:07 PM, Eric M. Ludlam <er...@si...> wrote: > Hi Oleg, > > Replacing the lexer is pretty easy. In most languages (I'll use dot again) > there is a line that looks like this: > > (setq > ;; Lexical Analysis > semantic-lex-analyzer 'wisent-dot-lexer > ... > > which was created like this: > > > (define-lex wisent-dot-lexer > "Lexical analyzer that handles DOT buffers. > It ignores whitespace, newlines and comments." > semantic-lex-ignore-whitespace > ... > > If you follow the doc trail, you end up with this: > > ----------- > semantic-lex is an autoloaded compiled Lisp function in `lex.el'. > > (semantic-lex START END &optional DEPTH LENGTH) > > Lexically analyze text in the current buffer between START and END. > Optional argument DEPTH indicates at what level to scan over entire > lists. The last argument, LENGTH specifies that `semantic-lex' > should only return LENGTH tokens. The return value is a token stream. > Each element is a list, such of the form > (symbol start-expression . end-expression) > where SYMBOL denotes the token type. > See `semantic-lex-tokens' variable for details on token types. END > does not mark the end of the text scanned, only the end of the > beginning of text scanned. Thus, if a string extends past END, the > end of the return token will be larger than END. To truly restrict > scanning, use `narrow-to-region'. > ---------- > > So this function will parse an entire buffer and return all the lexical > tokens for it. > > You can put anything you want in there, and return tokens with any old > SYMBOL you want too. > > Semantic's first lexer (see semantic-flex) is an example of a different > standalone lexer. It was only after struggling with that for a while that > the mechanism for making customer lexers came up. It is partly modeled > after lex/flex where regexp are associated with actions. > > If regexp don't make sense for your language, then rolling your own is no > problem. You still need to add %token expressions in your grammar to let > the grammar know what is going on though. You just don't need to specify > all the regexp along the way, or use the automatically generated lexers. > > Your big scary regexp below is not really necessary for writing lexers using > the lexing technique I described last time though. Each expression only > needs to be as long as the small piece you are looking at (ie - a number or > symbol). If you find that your language has a lexical token whose type is > based on previous lexical tokens, then you are right that the built in lexer > is probably insufficient. > > The C parser has examples that parse out things like: > > #include <foo.h> > > and > > #if SOMESYMBOL > #endif > > that way, but it gets pretty hairy. > > Eric > > > On 08/16/2014 06:52 PM, Left Right wrote: >> >> Just to give you a sense of what I /don't/ want to have in my code >> (below is my own code, so I'm allowed to say that it's an >> unmaintainable cuneiform) >> >> (defvar fmt-font-lock-keywords >> ;; no-args >> `(("~\\(@:?\\|:@?\\)?[]>()}aswvcp;_]" >> (0 font-lock-keyword-face)) >> ;; numeric-arg >> ("~\\([0-9]*\\|#,?\\)\\(@:?\\|:@?\\)?[i*%&|~{[]" >> (0 font-lock-keyword-face)) >> ;; decimal >> >> ("~\\([0-9]*\\|#\\(,[0-9]*\\|#\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?[rdbox]" >> (0 font-lock-keyword-face)) >> ;; floating-point f >> (,(concat >> >> "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,2\\}\\)\\|" >> "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)" >> "?\\(@:?\\|:@?\\)?f") >> (0 font-lock-keyword-face)) >> ;; floating-point e, g >> (,(concat >> >> "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,3\\}\\)\\|" >> "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)" >> "?\\(@:?\\|:@?\\)?[eg]") >> (0 font-lock-keyword-face)) >> ;; currency >> (,(concat >> >> "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{2\\}\\(,'\\w\\)\\)\\|" >> "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)\\)" >> "?\\(@:?\\|:@?\\)?[$]") >> (0 font-lock-keyword-face)) >> ;; tabulation >> ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)?\\)?\\(@:?\\|:@?\\)?t" >> (0 font-lock-keyword-face)) >> ;; escape >> >> ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)?\\(@:?\\|:@?\\)?^" >> (0 font-lock-keyword-face)) >> ;; logical block >> >> ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?<" >> (0 font-lock-keyword-face)) >> ;; custom function >> (,(concat >> "~\\(\\([0-9]+\\|'\\w\\|#\\)\\(,\\([0-9]+\\|'\\w\\|#\\)+\\)*\\)?" >> "\\(@:?\\|:@?\\)?\\/[^\\s\\n,#@]+\\/") >> (0 font-lock-keyword-face)))) >> >> This is my previous version of font-lock coloring. I don't expect you >> to read through it, but just to make the point even more obvious: this >> is actually a single regular expression, which I chopped into pieces >> for "ease" of use. The lexer based on regexp would need to have this >> mess concatenated into a single expression. Maybe it can be >> simplified, but not by much. The corresponding parsing function, which >> doesn't use regular expressions would be somewhere between 1/3 and 1/2 >> of the above code, and it would be perfectly understandable. This is >> the case similar to email parsing: you can do it with regular grammar, >> in principle, but there is no good way to do it in practice. >> >> I later found that I can provide a function to font-lock to replace >> this mess, and I would be happy if there was a way to do the same in >> place of the Semantic lexer. >> >> Best, >> >> Oleg >> >> On Sun, Aug 17, 2014 at 1:12 AM, Left Right<ole...@gm...> wrote: >>> >>> Hi and thanks for thorough replies. I think I will need to go through >>> them again, but so far I have one questions, which may possibly spare >>> me trying to intern all of this material. >>> >>> My last question wrt define-lex-regex wasn't about how could I make a >>> regular expression based lexer. The truth is: I don't want any regular >>> expressions there, they are just not cut for the task. So, let me >>> rephrase it: can I ditch the whole mechanism of the lexer and replace >>> it with a brand new code, which will handle the tokenization? What >>> will I need to do to achieve this? I don't have much experience with >>> writing lexers, in fact, I only ever used cl-yacc, which simply leaves >>> it to the programmer to implement a lexer and only requires a very >>> simple interface: a function that accepts an input stream and returns >>> a token. >>> >>> The reason I'm asking: the tokenizer mechanism looks very complex to >>> me, way more complex than what I need, besides, it works in a very >>> inconvenient way: if I could keep the state between the calls to the >>> lexer it would make my life so much easier. I also don't want to >>> depend on syntax table and whatever bizarre rules Emacs uses to >>> understand the syntax: this is not your fault, but I have an >>> impression that many of these rules are there purely by accident, they >>> are hard to discover and even harder to understand, because they don't >>> match anything you might come to expect from a lexer / parser. It >>> would be just a whole lot easier to start fresh, than to try to glue >>> together a bunch of jigsaw puzzle pieces, clearly taken from different >>> puzzles. >>> >>> I can understand the motivation for someone who gets the syntax table >>> and mode coloring for free from an existing mode and wants to reuse it >>> in order to build the lexer. I don't have these preconditions, and >>> even the little that I do have, I've written myself, and I'd rather >>> give it up to make the overall process more consistent. I.e. I don't >>> want to have to design font-lock rules, syntax table and the lexer >>> separately: to me this would be like doing the same work twice, but >>> both times using inappropriate tools. >>> >>> Best, >>> >>> Oleg >>> >>> On Sun, Aug 10, 2014 at 5:41 PM, Eric M. Ludlam<er...@si...> >>> wrote: >>>> >>>> On 08/02/2014 05:28 PM, Left Right wrote: >>>>> >>>>> >>>>> One more question, I'm trying to follow the inline code documentation, >>>>> and here's something I came up with, but I have lots of questions >>>>> about it: >>>>> >>>>> (define-lex-regex-analyzer fmt-lex-filler >>>>> "Matches the filler in the format string." >>>>> "[^~]+" >>>>> (semantic-lex-push-token >>>>> (semantic-lex-token >>>>> 'filler (match-beginning 0) (match-end 0)))) >>>>> >>>>> (define-lex wisent-fmt-lexer >>>>> "Lexical analyzer that handles Common Lisp format." >>>>> fmt-lex-filler) >>>>> >>>>> 1. Using regular expression in this analyzer is a really, really bad >>>>> idea (the proper regexp is more than 300 characters long, this one is >>>>> here just for illustration), but this complexity can be easily avoided >>>>> if instead of regular expression I could use a function that takes, >>>>> say, position in the buffer or something like that: is that even >>>>> possible? >>>>> >>>>> 2. 'filler isn't a default kind of token, is my guess correct that I >>>>> can somehow refer to this kind in the grammar, similar to how %type >>>>> <symbol> is defined, maybe? What would I need to do to make this >>>>> possible? >>>> >>>> >>>> >>>> There is a default whitespace token you can create from your lexers. For >>>> exmaple, the dot lexer starts with these: >>>> >>>> semantic-lex-ignore-whitespace >>>> semantic-lex-ignore-newline >>>> semantic-lex-ignore-comments >>>> >>>> which is implemented like this: >>>> >>>> (define-lex-regex-analyzer semantic-lex-ignore-whitespace >>>> "Detect and skip over whitespace tokens." >>>> ;; catch whitespace when needed >>>> "\\s-+" >>>> ;; Skip over the detected whitespace, do not create a token for it. >>>> (setq semantic-lex-end-point (match-end 0))) >>>> >>>> >>>> which means "go to the end of the match, and don't return a token. As >>>> you >>>> have in your lexer, you have to push the 'filler token to get it on the >>>> stack. >>>> >>>> The reason you have to set the end point is because when you push a >>>> token, >>>> it looks at the end of your token, and moves there automatically, but if >>>> you >>>> don't push a token, you have to move it by hand. >>>> >>>> Lexical analyzers are interesting, in that while a function is made for >>>> them, those functions aren't used. Instead they also have a value, and >>>> those values are concatenated together to create the master lexer >>>> function. >>>> Like a big cond statement. The main lexer has logic it applies after >>>> each >>>> match is found, and that is where a bunch of the magic happens. >>>> >>>> If you aren't trying to ignore your 'filler tokens, you will instead >>>> need a >>>> %token declaration for it, such as: >>>> >>>> %token filler >>>> >>>> If you instead had >>>> >>>> %type<filler> syntax "[^~]+" >>>> >>>> you wouldn't need to write your lexical analyzer at all and one would be >>>> provided for you. (I think, I'm a little fuzzy on that one.) >>>> >>>> Your filler lexer is OK if it is something you really need, but because >>>> it >>>> can match so much, you MUST put it at the END of your defined lexer. >>>> That >>>> way you will be able to match all your other expressions, and if nothing >>>> works, you call it filler. >>>> >>>> use: >>>> >>>> M-x semantic-lex-test RET >>>> >>>> to see how it works, or >>>> >>>> M-x semantic-lex-debug RET >>>> >>>> to watch your lexer run. >>>> >>>> Good Luck >>>> Eric >> >> > |