Thread: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

Brought to you by: zappo

cedet-devel

[CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Oleg S. <ole...@gm...> - 2014-07-02 19:30:18

Hi list,
I'm getting this message:

senator-next-tag: Buffer was not parsed by Semantic.

After I thought I've compiled a simple grammar for a mode that I'm
trying to test. Below is the initialization mode stuff:

(define-derived-mode fmt-mode fundamental-mode
  "Common Lisp Format mode"
  "Major for highlighting of Common Lisp format mini-language
This mode uses its own keymap:
\\{fmt-mode-map}"
  (kill-all-local-variables)
  (setq major-mode 'fmt-mode)
  (use-local-map fmt-mode-map)
  (setf mode-name "Common Lisp Format")
  (run-hooks 'fmt-mode-hook)
  (semantic-mode 1))

Nothing fancy, I'm sure it reaches the (semantic-mode 1) call.

I have a fmt.wy file from which I can generate a ftm-wy.el which has the
following: 

(defun fmt-wy--install-parser ()
  "Setup the Semantic Parser."
  (semantic-install-function-overrides
   '((parse-stream . wisent-parse-stream)))
  (setq semantic-parser-name "LALR"
        semantic--parse-table fmt-wy--parse-table
        semantic-debug-parser-source "fmt.wy"
        semantic-flex-keywords-obarray fmt-wy--keyword-table
        semantic-lex-types-obarray fmt-wy--token-table)
  ;; Collect unmatched syntax lexical tokens
  (semantic-make-local-hook 'wisent-discarding-token-functions)
  (add-hook 'wisent-discarding-token-functions
            'wisent-collect-unmatched-syntax nil t))

(define-lex wisent-fmt-lexer
  "Lexical analyzer that handles Common Lisp format."
  semantic-lex-ignore-newline
  semantic-lex-ignore-comments
  semantic-lex-default-action)

(provide 'fmt-wy)

I can require fmt-wy allright (it gives some warnings, but they don't
seem to be important) But now parsing seems to be happening in the test
file I'm trying to edit. What did I have to do beside what I've done?

Also, how would I debug reduce conflicts? Is there any way to make
Semantic more verbose when reporting them? The report of having a reduce
conflic is really like pointing a finger at the sky... unless you give a
hint about what terminals or rules are in conflict.

Lastly, sorry I put many issues together! Is there a way to create
character classes, such as, for example "any character but tilda"? Well,
actually, negation would help my case too, but just for general
knowledge I'd like, if possible, to know the answer to the character
classes question too!

Thanks,

Oleg

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Left R. <ole...@gm...> - 2014-07-02 19:34:18

Sorry, I forgot to mention, my fmt.wy file has this:

%languagemode fmt-mode

(I believe this should make Semantics use the parser in the fmt-mode,
shouldn't it?)

On Wed, Jul 2, 2014 at 10:28 PM, Oleg Sivokon <ole...@gm...> wrote:
> Hi list,
> I'm getting this message:
>
> senator-next-tag: Buffer was not parsed by Semantic.
>
> After I thought I've compiled a simple grammar for a mode that I'm
> trying to test. Below is the initialization mode stuff:
>
> (define-derived-mode fmt-mode fundamental-mode
>   "Common Lisp Format mode"
>   "Major for highlighting of Common Lisp format mini-language
> This mode uses its own keymap:
> \\{fmt-mode-map}"
>   (kill-all-local-variables)
>   (setq major-mode 'fmt-mode)
>   (use-local-map fmt-mode-map)
>   (setf mode-name "Common Lisp Format")
>   (run-hooks 'fmt-mode-hook)
>   (semantic-mode 1))
>
> Nothing fancy, I'm sure it reaches the (semantic-mode 1) call.
>
> I have a fmt.wy file from which I can generate a ftm-wy.el which has the
> following:
>
> (defun fmt-wy--install-parser ()
>   "Setup the Semantic Parser."
>   (semantic-install-function-overrides
>    '((parse-stream . wisent-parse-stream)))
>   (setq semantic-parser-name "LALR"
>         semantic--parse-table fmt-wy--parse-table
>         semantic-debug-parser-source "fmt.wy"
>         semantic-flex-keywords-obarray fmt-wy--keyword-table
>         semantic-lex-types-obarray fmt-wy--token-table)
>   ;; Collect unmatched syntax lexical tokens
>   (semantic-make-local-hook 'wisent-discarding-token-functions)
>   (add-hook 'wisent-discarding-token-functions
>             'wisent-collect-unmatched-syntax nil t))
>
> (define-lex wisent-fmt-lexer
>   "Lexical analyzer that handles Common Lisp format."
>   semantic-lex-ignore-newline
>   semantic-lex-ignore-comments
>   semantic-lex-default-action)
>
> (provide 'fmt-wy)
>
> I can require fmt-wy allright (it gives some warnings, but they don't
> seem to be important) But now parsing seems to be happening in the test
> file I'm trying to edit. What did I have to do beside what I've done?
>
> Also, how would I debug reduce conflicts? Is there any way to make
> Semantic more verbose when reporting them? The report of having a reduce
> conflic is really like pointing a finger at the sky... unless you give a
> hint about what terminals or rules are in conflict.
>
> Lastly, sorry I put many issues together! Is there a way to create
> character classes, such as, for example "any character but tilda"? Well,
> actually, negation would help my case too, but just for general
> knowledge I'd like, if possible, to know the answer to the character
> classes question too!
>
> Thanks,
>
> Oleg

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Eric M. L. <er...@si...> - 2014-07-03 11:25:33

On 07/02/2014 03:28 PM, Oleg Sivokon wrote:
> Hi list,
> I'm getting this message:
>
> senator-next-tag: Buffer was not parsed by Semantic.
>
> After I thought I've compiled a simple grammar for a mode that I'm
> trying to test. Below is the initialization mode stuff:
>
> (define-derived-mode fmt-mode fundamental-mode
>    "Common Lisp Format mode"
>    "Major for highlighting of Common Lisp format mini-language
> This mode uses its own keymap:
> \\{fmt-mode-map}"
>    (kill-all-local-variables)
>    (setq major-mode 'fmt-mode)
>    (use-local-map fmt-mode-map)
>    (setf mode-name "Common Lisp Format")
>    (run-hooks 'fmt-mode-hook)
>    (semantic-mode 1))
>
> Nothing fancy, I'm sure it reaches the (semantic-mode 1) call.

Hi Oleg,

`semantic-mode' only needs to be called once when you start Emacs.  To 
get your mode setup for parsing via Semantic you need to add your setup 
function to `semantic-new-buffer-setup-functions'.  I suppose you could 
also just call your setup function directly from your mode too if you 
wanted, but then your mode would depend on Semantic directly.

For a fresh new mode, you would need 3 files:

blah-mode.el - The standard Emacs mode for your mode.
blah.wy & blah-wy.el - The parser and generated file.
semantic-blah.el or wisent-blah.el - The hand written support code for 
the parser.

The support file will have your -setup function.  The setup function 
will call your --install-parser function, setup any special variables 
needed when Semantic is active (such as which lexer to use and any 
override variables such as how to convert tag classes into nice strings.

You could look at SRecode's template mode as an example.  It has 
everything together in that case there is:

srt.wy
srt-wy.el
template.el  - hand written support file
>
> I can require fmt-wy allright (it gives some warnings, but they don't
> seem to be important) But now parsing seems to be happening in the test
> file I'm trying to edit. What did I have to do beside what I've done?
>
> Also, how would I debug reduce conflicts? Is there any way to make
> Semantic more verbose when reporting them? The report of having a reduce
> conflic is really like pointing a finger at the sky... unless you give a
> hint about what terminals or rules are in conflict.

Hopefully the message will help identify the problem.  There is a short 
section in the 'wisent' doc for how to fix them.  You could also check 
Bison's doc, as the technique is the same.  Sadly it is more by code 
inspection than a debugger.

> Lastly, sorry I put many issues together! Is there a way to create
> character classes, such as, for example "any character but tilda"? Well,
> actually, negation would help my case too, but just for general
> knowledge I'd like, if possible, to know the answer to the character
> classes question too!

You will need to create a custom lex rule.  That uses Emacs regex rules. 
  Thus you could create "[^~]"  for anything but tilde, or "[~]" for 
only tilde's.  Check the elisp manual for all the fun regexp rules.

Good Luck
Eric

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Eric M. L. <er...@si...> - 2014-07-12 01:40:20

Hi Oleg,

I'm not sure how to debug the fcns you posted below.  I think they are 
ok.   Since you appear to be defining your own mode, let me instead 
annotate through how the the parsing for "dot" works which is found in 
these files:

lisp/cedet/cogre/dot-mode.el
lisp/cedet/cogre/wisent-dot.wy
lisp/cedet/cogre/wisent-dot.el

and generated file

lisp/cedet/cogre/wisent-dot-wy.el

I picked this mode because it is pretty simple, just enough to get the 
layout code of COGRE working.  It is also not installed by default in 
semantic-new-buffer-setup-functions.


Lets start in dot-mode.el:

Note the syntax table.  This part is critical for the lexer to work.  If 
you duplicated some other mode, you probably have one of these.

In cogre-dot-mode which is named such to avoid conflict with other dot 
modes.  Note it sets up comment-start and comment-start-skip - these are 
important for the lexer also.

Also note the hook running at the end.

Note the auto-mode-alist modification.

Lastly, note the mode-local-parent stuff.  That is setup to make sure 
that cogre-dot-mode is in agree with graphviz-dot-mode.  You don't need 
anything like this if your mode is standalone.


Next is wisent-dot.wy.

At the beginning is the langauage-mode setting that matches, in this 
case, the core graphviz mode which I had to make optional.  I think you 
did this correctly already.

At the end after the %% is a lexer definition.  This uses a bunch of 
default stuff, plus lexers defined in the language for keywords, etc.

You can then compile this grammar into wisent-dot-wy.el.  If you are in 
a compile debug cycle, you need to then enter wisent-dot-wy.el, and 
force eval with C-M-x several tables because the defvars carefully save 
old values if you just eval the buffer causing a no-op. :(

Last is the key piece: wisent-dot.el

Note that this pulls in wisent-dot-wy, plus wisent itself and any 
sources to functions you need to override.

The override for semantic-tag-components is important to implement if 
you have ANY tags that are compound, such as a class with fields, etc.

Note wisent-dot-setup-parser.  It installs the parser using a function 
from wisent-dot-wy.el.  That is how the parser gets pulled in.

It also sets up the lexer, extra syntax mods needed, and a few other 
rndom things such as command separators and how to convert your tag 
classes into text strings.  On the whole, the first statement and the 
first 2 variables are the most important.  The rest is optional.

Lastly are hooks to run the parser setup.

These hooks can be replaced by adding the setup function to 
semantic-new-buffer-setup-functions.

Feel free to start with the hook, and use the setup function when you 
want to make semantic support optional with your mode.

If you already did all this, it could be that your parser is broken, or 
parser recompiles are not getting loaded in correctly.  Fire up a new 
emacs and load your code and test it to avoid the recompile issue.  If 
that helps, you need to hand load variable changes from generated files.

Another good trick is to use semantic-show-parser-state-mode.  this 
shows symbols in the mode line to tell you how the parser is doing.  It 
will either refuse to start if the parser is not installed, or show % if 
the parser is broken, or if the buffer you are parsing is just not complete.

Another fun one is semantic-highlight-edits-mode which shows how the 
buffer is edited and reparsed which is helpful if the incremental parser 
is broken with your language parser.

Lastly use semantic-show-unmatched-syntax-mode to see if the parser is 
just tagging your whole buffer as unparsable.  If this happens, you need 
to work on your parser some more.

I hope this helps.
Eric

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Left R. <ole...@gm...> - 2014-08-02 11:59:32

Hi Eric,

Sorry it took me so long to reply. I was finally able to at least get
the dot-mode to work. The way I managed was by requiring:

(require 'cogre/dot-mode)
(require 'cogre/wisent-dot)
(require 'cogre/wisent-dot-wy)

I also needed to update from Semantic bundled with Emacs 24.3.50 to
the one I pulled from VCS today, otherwise, as I discovered post
factum, it was trying to use a different parser (LR(1) instead of LL),
I'm not sure how does this change come about, since the dot mode files
didn't change across the versions. Yet when it was reading the grammar
using LR parser, it would run into shift/reduce conflicts.

I'm still struggling with my mode though, and, if you will be so kind,
could you, please, explain few things about dot grammar?

%type  <punctuation> syntax "\\s.+"

I searched high and low, but I can't find an exhaustive reference to
Emacs-style regexp, therefore I can't tell for sure what does this
regexp mean: but I came to believe that it means a single "whitespace"
character followed by whatever. I can't understand the meaning of this
line, despite reading the documentation:

---- begin quote ----

— %-Decl: %type <type-name> [property1 value1 ...]

Explicitly declare a lexical type, and optionally give it properties.

type-nameIs a symbol that identifies the type.
propertyIs a property name, a valid Emacs Lisp symbol.
valueIs a property value, a valid Emacs Lisp constant expression.

Even if %token, %keyword, and precedence declarations can implicitly
declare types, an explicit declaration is required for every type:

To assign it properties.
To auto-generate a lexical rule that detects tokens of this type. For
more information, Auto-generation of lexical rules.

---- end quote ----

What does this declaration do? This looks suspiciously similar to the
entries in syntax table, but then it doesn't make much sense, since
Emacs has a different way to mark punctuation...

Second:

%token <block>       BRACKET_BLOCK "(LBRACKET RBRACKET)"

---- begin quote ----

The %token statement declares a terminal symbol (a token) which is not
a keyword.

— %-Decl: %token [<type-name>] token-name match-value
— %-Decl: %token [<type-name>] token-name1 ...

Respectively declare one token with an optional type, and a match
value, or several tokens with the optional same type, and no match
value.

type-nameIs an optional symbol, enclosed between < and >, that
specifies (and implicitly declares) a type for this token (see type
Decl). If omitted the token has no type.
token-nameIs the terminal symbol used in grammar rules to represent this token.
match-valueIs an optional string. Depending on type-name properties,
it will be interpreted as an ordinary string, a regular expression, or
have a more elaborate meaning. If omitted the match value will be nil,
which means that this token will be considered as the default token of
its type (see type Decl for more information).

---- end quote ----

The documentation speaks about some "more elaborate meaning". Can you
tell, please, what is this meaning? The two things inside the
parenthesis are another tokens which match literal brackets, but does
this one match "[]" or "\\[[^\\]]+\\]"?

Third:

;;; Bland default types
%type  <symbol>
%token <symbol> symbol

%type  <string>
%token <string> string

%type  <number>
%token <number> number

I understand what is this supposed to do, but I can't understand how
it achieves that. Can you, please, interpret that in words? To me this
looks like magic: how does token `number' know how to match numbers?

PS. The links to Bison documentation in the online version are broken
(they point to www.randomsample.de instead of www.gnu.org)

Thanks,

Oleg

On Sat, Jul 12, 2014 at 4:40 AM, Eric M. Ludlam <er...@si...> wrote:
> Hi Oleg,
>
> I'm not sure how to debug the fcns you posted below.  I think they are ok.
> Since you appear to be defining your own mode, let me instead annotate
> through how the the parsing for "dot" works which is found in these files:
>
> lisp/cedet/cogre/dot-mode.el
> lisp/cedet/cogre/wisent-dot.wy
> lisp/cedet/cogre/wisent-dot.el
>
> and generated file
>
> lisp/cedet/cogre/wisent-dot-wy.el
>
> I picked this mode because it is pretty simple, just enough to get the
> layout code of COGRE working.  It is also not installed by default in
> semantic-new-buffer-setup-functions.
>
>
> Lets start in dot-mode.el:
>
> Note the syntax table.  This part is critical for the lexer to work.  If you
> duplicated some other mode, you probably have one of these.
>
> In cogre-dot-mode which is named such to avoid conflict with other dot
> modes.  Note it sets up comment-start and comment-start-skip - these are
> important for the lexer also.
>
> Also note the hook running at the end.
>
> Note the auto-mode-alist modification.
>
> Lastly, note the mode-local-parent stuff.  That is setup to make sure that
> cogre-dot-mode is in agree with graphviz-dot-mode.  You don't need anything
> like this if your mode is standalone.
>
>
> Next is wisent-dot.wy.
>
> At the beginning is the langauage-mode setting that matches, in this case,
> the core graphviz mode which I had to make optional.  I think you did this
> correctly already.
>
> At the end after the %% is a lexer definition.  This uses a bunch of default
> stuff, plus lexers defined in the language for keywords, etc.
>
> You can then compile this grammar into wisent-dot-wy.el.  If you are in a
> compile debug cycle, you need to then enter wisent-dot-wy.el, and force eval
> with C-M-x several tables because the defvars carefully save old values if
> you just eval the buffer causing a no-op. :(
>
> Last is the key piece: wisent-dot.el
>
> Note that this pulls in wisent-dot-wy, plus wisent itself and any sources to
> functions you need to override.
>
> The override for semantic-tag-components is important to implement if you
> have ANY tags that are compound, such as a class with fields, etc.
>
> Note wisent-dot-setup-parser.  It installs the parser using a function from
> wisent-dot-wy.el.  That is how the parser gets pulled in.
>
> It also sets up the lexer, extra syntax mods needed, and a few other rndom
> things such as command separators and how to convert your tag classes into
> text strings.  On the whole, the first statement and the first 2 variables
> are the most important.  The rest is optional.
>
> Lastly are hooks to run the parser setup.
>
> These hooks can be replaced by adding the setup function to
> semantic-new-buffer-setup-functions.
>
> Feel free to start with the hook, and use the setup function when you want
> to make semantic support optional with your mode.
>
> If you already did all this, it could be that your parser is broken, or
> parser recompiles are not getting loaded in correctly.  Fire up a new emacs
> and load your code and test it to avoid the recompile issue.  If that helps,
> you need to hand load variable changes from generated files.
>
> Another good trick is to use semantic-show-parser-state-mode.  this shows
> symbols in the mode line to tell you how the parser is doing.  It will
> either refuse to start if the parser is not installed, or show % if the
> parser is broken, or if the buffer you are parsing is just not complete.
>
> Another fun one is semantic-highlight-edits-mode which shows how the buffer
> is edited and reparsed which is helpful if the incremental parser is broken
> with your language parser.
>
> Lastly use semantic-show-unmatched-syntax-mode to see if the parser is just
> tagging your whole buffer as unparsable.  If this happens, you need to work
> on your parser some more.
>
> I hope this helps.
> Eric

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Left R. <ole...@gm...> - 2014-08-02 21:28:35

One more question, I'm trying to follow the inline code documentation,
and here's something I came up with, but I have lots of questions
about it:

(define-lex-regex-analyzer fmt-lex-filler
  "Matches the filler in the format string."
  "[^~]+"
  (semantic-lex-push-token
     (semantic-lex-token
      'filler (match-beginning 0) (match-end 0))))

(define-lex wisent-fmt-lexer
  "Lexical analyzer that handles Common Lisp format."
  fmt-lex-filler)

1. Using regular expression in this analyzer is a really, really bad
idea (the proper regexp is more than 300 characters long, this one is
here just for illustration), but this complexity can be easily avoided
if instead of regular expression I could use a function that takes,
say, position in the buffer or something like that: is that even
possible?

2. 'filler isn't a default kind of token, is my guess correct that I
can somehow refer to this kind in the grammar, similar to how %type
<symbol> is defined, maybe? What would I need to do to make this
possible?

Thanks!

Oleg

On Sat, Aug 2, 2014 at 2:59 PM, Left Right <ole...@gm...> wrote:
> Hi Eric,
>
> Sorry it took me so long to reply. I was finally able to at least get
> the dot-mode to work. The way I managed was by requiring:
>
> (require 'cogre/dot-mode)
> (require 'cogre/wisent-dot)
> (require 'cogre/wisent-dot-wy)
>
> I also needed to update from Semantic bundled with Emacs 24.3.50 to
> the one I pulled from VCS today, otherwise, as I discovered post
> factum, it was trying to use a different parser (LR(1) instead of LL),
> I'm not sure how does this change come about, since the dot mode files
> didn't change across the versions. Yet when it was reading the grammar
> using LR parser, it would run into shift/reduce conflicts.
>
> I'm still struggling with my mode though, and, if you will be so kind,
> could you, please, explain few things about dot grammar?
>
> %type  <punctuation> syntax "\\s.+"
>
> I searched high and low, but I can't find an exhaustive reference to
> Emacs-style regexp, therefore I can't tell for sure what does this
> regexp mean: but I came to believe that it means a single "whitespace"
> character followed by whatever. I can't understand the meaning of this
> line, despite reading the documentation:
>
> ---- begin quote ----
>
> — %-Decl: %type <type-name> [property1 value1 ...]
>
> Explicitly declare a lexical type, and optionally give it properties.
>
> type-nameIs a symbol that identifies the type.
> propertyIs a property name, a valid Emacs Lisp symbol.
> valueIs a property value, a valid Emacs Lisp constant expression.
>
> Even if %token, %keyword, and precedence declarations can implicitly
> declare types, an explicit declaration is required for every type:
>
> To assign it properties.
> To auto-generate a lexical rule that detects tokens of this type. For
> more information, Auto-generation of lexical rules.
>
> ---- end quote ----
>
> What does this declaration do? This looks suspiciously similar to the
> entries in syntax table, but then it doesn't make much sense, since
> Emacs has a different way to mark punctuation...
>
> Second:
>
> %token <block>       BRACKET_BLOCK "(LBRACKET RBRACKET)"
>
> ---- begin quote ----
>
> The %token statement declares a terminal symbol (a token) which is not
> a keyword.
>
> — %-Decl: %token [<type-name>] token-name match-value
> — %-Decl: %token [<type-name>] token-name1 ...
>
> Respectively declare one token with an optional type, and a match
> value, or several tokens with the optional same type, and no match
> value.
>
> type-nameIs an optional symbol, enclosed between < and >, that
> specifies (and implicitly declares) a type for this token (see type
> Decl). If omitted the token has no type.
> token-nameIs the terminal symbol used in grammar rules to represent this token.
> match-valueIs an optional string. Depending on type-name properties,
> it will be interpreted as an ordinary string, a regular expression, or
> have a more elaborate meaning. If omitted the match value will be nil,
> which means that this token will be considered as the default token of
> its type (see type Decl for more information).
>
> ---- end quote ----
>
> The documentation speaks about some "more elaborate meaning". Can you
> tell, please, what is this meaning? The two things inside the
> parenthesis are another tokens which match literal brackets, but does
> this one match "[]" or "\\[[^\\]]+\\]"?
>
> Third:
>
> ;;; Bland default types
> %type  <symbol>
> %token <symbol> symbol
>
> %type  <string>
> %token <string> string
>
> %type  <number>
> %token <number> number
>
> I understand what is this supposed to do, but I can't understand how
> it achieves that. Can you, please, interpret that in words? To me this
> looks like magic: how does token `number' know how to match numbers?
>
> PS. The links to Bison documentation in the online version are broken
> (they point to www.randomsample.de instead of www.gnu.org)
>
> Thanks,
>
> Oleg
>
> On Sat, Jul 12, 2014 at 4:40 AM, Eric M. Ludlam <er...@si...> wrote:
>> Hi Oleg,
>>
>> I'm not sure how to debug the fcns you posted below.  I think they are ok.
>> Since you appear to be defining your own mode, let me instead annotate
>> through how the the parsing for "dot" works which is found in these files:
>>
>> lisp/cedet/cogre/dot-mode.el
>> lisp/cedet/cogre/wisent-dot.wy
>> lisp/cedet/cogre/wisent-dot.el
>>
>> and generated file
>>
>> lisp/cedet/cogre/wisent-dot-wy.el
>>
>> I picked this mode because it is pretty simple, just enough to get the
>> layout code of COGRE working.  It is also not installed by default in
>> semantic-new-buffer-setup-functions.
>>
>>
>> Lets start in dot-mode.el:
>>
>> Note the syntax table.  This part is critical for the lexer to work.  If you
>> duplicated some other mode, you probably have one of these.
>>
>> In cogre-dot-mode which is named such to avoid conflict with other dot
>> modes.  Note it sets up comment-start and comment-start-skip - these are
>> important for the lexer also.
>>
>> Also note the hook running at the end.
>>
>> Note the auto-mode-alist modification.
>>
>> Lastly, note the mode-local-parent stuff.  That is setup to make sure that
>> cogre-dot-mode is in agree with graphviz-dot-mode.  You don't need anything
>> like this if your mode is standalone.
>>
>>
>> Next is wisent-dot.wy.
>>
>> At the beginning is the langauage-mode setting that matches, in this case,
>> the core graphviz mode which I had to make optional.  I think you did this
>> correctly already.
>>
>> At the end after the %% is a lexer definition.  This uses a bunch of default
>> stuff, plus lexers defined in the language for keywords, etc.
>>
>> You can then compile this grammar into wisent-dot-wy.el.  If you are in a
>> compile debug cycle, you need to then enter wisent-dot-wy.el, and force eval
>> with C-M-x several tables because the defvars carefully save old values if
>> you just eval the buffer causing a no-op. :(
>>
>> Last is the key piece: wisent-dot.el
>>
>> Note that this pulls in wisent-dot-wy, plus wisent itself and any sources to
>> functions you need to override.
>>
>> The override for semantic-tag-components is important to implement if you
>> have ANY tags that are compound, such as a class with fields, etc.
>>
>> Note wisent-dot-setup-parser.  It installs the parser using a function from
>> wisent-dot-wy.el.  That is how the parser gets pulled in.
>>
>> It also sets up the lexer, extra syntax mods needed, and a few other rndom
>> things such as command separators and how to convert your tag classes into
>> text strings.  On the whole, the first statement and the first 2 variables
>> are the most important.  The rest is optional.
>>
>> Lastly are hooks to run the parser setup.
>>
>> These hooks can be replaced by adding the setup function to
>> semantic-new-buffer-setup-functions.
>>
>> Feel free to start with the hook, and use the setup function when you want
>> to make semantic support optional with your mode.
>>
>> If you already did all this, it could be that your parser is broken, or
>> parser recompiles are not getting loaded in correctly.  Fire up a new emacs
>> and load your code and test it to avoid the recompile issue.  If that helps,
>> you need to hand load variable changes from generated files.
>>
>> Another good trick is to use semantic-show-parser-state-mode.  this shows
>> symbols in the mode line to tell you how the parser is doing.  It will
>> either refuse to start if the parser is not installed, or show % if the
>> parser is broken, or if the buffer you are parsing is just not complete.
>>
>> Another fun one is semantic-highlight-edits-mode which shows how the buffer
>> is edited and reparsed which is helpful if the incremental parser is broken
>> with your language parser.
>>
>> Lastly use semantic-show-unmatched-syntax-mode to see if the parser is just
>> tagging your whole buffer as unparsable.  If this happens, you need to work
>> on your parser some more.
>>
>> I hope this helps.
>> Eric

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Eric M. L. <er...@si...> - 2014-08-10 14:41:28

On 08/02/2014 05:28 PM, Left Right wrote:
> One more question, I'm trying to follow the inline code documentation,
> and here's something I came up with, but I have lots of questions
> about it:
>
> (define-lex-regex-analyzer fmt-lex-filler
>    "Matches the filler in the format string."
>    "[^~]+"
>    (semantic-lex-push-token
>       (semantic-lex-token
>        'filler (match-beginning 0) (match-end 0))))
>
> (define-lex wisent-fmt-lexer
>    "Lexical analyzer that handles Common Lisp format."
>    fmt-lex-filler)
>
> 1. Using regular expression in this analyzer is a really, really bad
> idea (the proper regexp is more than 300 characters long, this one is
> here just for illustration), but this complexity can be easily avoided
> if instead of regular expression I could use a function that takes,
> say, position in the buffer or something like that: is that even
> possible?
>
> 2. 'filler isn't a default kind of token, is my guess correct that I
> can somehow refer to this kind in the grammar, similar to how %type
> <symbol>  is defined, maybe? What would I need to do to make this
> possible?

There is a default whitespace token you can create from your lexers. 
For exmaple, the dot lexer starts with these:

   semantic-lex-ignore-whitespace
   semantic-lex-ignore-newline
   semantic-lex-ignore-comments

which is implemented like this:

(define-lex-regex-analyzer semantic-lex-ignore-whitespace
   "Detect and skip over whitespace tokens."
   ;; catch whitespace when needed
   "\\s-+"
   ;; Skip over the detected whitespace, do not create a token for it.
   (setq semantic-lex-end-point (match-end 0)))

which means "go to the end of the match, and don't return a token.  As 
you have in your lexer, you have to push the 'filler token to get it on 
the stack.

The reason you have to set the end point is because when you push a 
token, it looks at the end of your token, and moves there automatically, 
but if you don't push a token, you have to move it by hand.

Lexical analyzers are interesting, in that while a function is made for 
them, those functions aren't used.  Instead they also have a value, and 
those values are concatenated together to create the master lexer 
function.   Like a big cond statement.  The main lexer has logic it 
applies after each match is found, and that is where a bunch of the 
magic happens.

If you aren't trying to ignore your 'filler tokens, you will instead 
need a %token declaration for it, such as:

%token filler

If you instead had

%type<filler> syntax "[^~]+"

you wouldn't need to write your lexical analyzer at all and one would be 
provided for you.  (I think, I'm a little fuzzy on that one.)

Your filler lexer is OK if it is something you really need, but because 
it can match so much, you MUST put it at the END of your defined lexer. 
  That way you will be able to match all your other expressions, and if 
nothing works, you call it filler.

use:

M-x semantic-lex-test RET

to see how it works, or

M-x semantic-lex-debug RET

to watch your lexer run.

Good Luck
Eric

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Left R. <ole...@gm...> - 2014-08-16 22:12:29

Hi and thanks for thorough replies. I think I will need to go through
them again, but so far I have one questions, which may possibly spare
me trying to intern all of this material.

My last question wrt define-lex-regex wasn't about how could I make a
regular expression based lexer. The truth is: I don't want any regular
expressions there, they are just not cut for the task. So, let me
rephrase it: can I ditch the whole mechanism of the lexer and replace
it with a brand new code, which will handle the tokenization? What
will I need to do to achieve this? I don't have much experience with
writing lexers, in fact, I only ever used cl-yacc, which simply leaves
it to the programmer to implement a lexer and only requires a very
simple interface: a function that accepts an input stream and returns
a token.

The reason I'm asking: the tokenizer mechanism looks very complex to
me, way more complex than what I need, besides, it works in a very
inconvenient way: if I could keep the state between the calls to the
lexer it would make my life so much easier. I also don't want to
depend on syntax table and whatever bizarre rules Emacs uses to
understand the syntax: this is not your fault, but I have an
impression that many of these rules are there purely by accident, they
are hard to discover and even harder to understand, because they don't
match anything you might come to expect from a lexer / parser. It
would be just a whole lot easier to start fresh, than to try to glue
together a bunch of jigsaw puzzle pieces, clearly taken from different
puzzles.

I can understand the motivation for someone who gets the syntax table
and mode coloring for free from an existing mode and wants to reuse it
in order to build the lexer. I don't have these preconditions, and
even the little that I do have, I've written myself, and I'd rather
give it up to make the overall process more consistent. I.e. I don't
want to have to design font-lock rules, syntax table and the lexer
separately: to me this would be like doing the same work twice, but
both times using inappropriate tools.

Best,

Oleg

On Sun, Aug 10, 2014 at 5:41 PM, Eric M. Ludlam <er...@si...> wrote:
> On 08/02/2014 05:28 PM, Left Right wrote:
>>
>> One more question, I'm trying to follow the inline code documentation,
>> and here's something I came up with, but I have lots of questions
>> about it:
>>
>> (define-lex-regex-analyzer fmt-lex-filler
>>    "Matches the filler in the format string."
>>    "[^~]+"
>>    (semantic-lex-push-token
>>       (semantic-lex-token
>>        'filler (match-beginning 0) (match-end 0))))
>>
>> (define-lex wisent-fmt-lexer
>>    "Lexical analyzer that handles Common Lisp format."
>>    fmt-lex-filler)
>>
>> 1. Using regular expression in this analyzer is a really, really bad
>> idea (the proper regexp is more than 300 characters long, this one is
>> here just for illustration), but this complexity can be easily avoided
>> if instead of regular expression I could use a function that takes,
>> say, position in the buffer or something like that: is that even
>> possible?
>>
>> 2. 'filler isn't a default kind of token, is my guess correct that I
>> can somehow refer to this kind in the grammar, similar to how %type
>> <symbol>  is defined, maybe? What would I need to do to make this
>> possible?
>
>
> There is a default whitespace token you can create from your lexers. For
> exmaple, the dot lexer starts with these:
>
>   semantic-lex-ignore-whitespace
>   semantic-lex-ignore-newline
>   semantic-lex-ignore-comments
>
> which is implemented like this:
>
> (define-lex-regex-analyzer semantic-lex-ignore-whitespace
>   "Detect and skip over whitespace tokens."
>   ;; catch whitespace when needed
>   "\\s-+"
>   ;; Skip over the detected whitespace, do not create a token for it.
>   (setq semantic-lex-end-point (match-end 0)))
>
>
> which means "go to the end of the match, and don't return a token.  As you
> have in your lexer, you have to push the 'filler token to get it on the
> stack.
>
> The reason you have to set the end point is because when you push a token,
> it looks at the end of your token, and moves there automatically, but if you
> don't push a token, you have to move it by hand.
>
> Lexical analyzers are interesting, in that while a function is made for
> them, those functions aren't used.  Instead they also have a value, and
> those values are concatenated together to create the master lexer function.
> Like a big cond statement.  The main lexer has logic it applies after each
> match is found, and that is where a bunch of the magic happens.
>
> If you aren't trying to ignore your 'filler tokens, you will instead need a
> %token declaration for it, such as:
>
> %token filler
>
> If you instead had
>
> %type<filler> syntax "[^~]+"
>
> you wouldn't need to write your lexical analyzer at all and one would be
> provided for you.  (I think, I'm a little fuzzy on that one.)
>
> Your filler lexer is OK if it is something you really need, but because it
> can match so much, you MUST put it at the END of your defined lexer.  That
> way you will be able to match all your other expressions, and if nothing
> works, you call it filler.
>
> use:
>
> M-x semantic-lex-test RET
>
> to see how it works, or
>
> M-x semantic-lex-debug RET
>
> to watch your lexer run.
>
> Good Luck
> Eric

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Left R. <ole...@gm...> - 2014-08-16 22:52:16

Just to give you a sense of what I /don't/ want to have in my code
(below is my own code, so I'm allowed to say that it's an
unmaintainable cuneiform)

(defvar fmt-font-lock-keywords
  ;; no-args
  `(("~\\(@:?\\|:@?\\)?[]>()}aswvcp;_]"
     (0 font-lock-keyword-face))
    ;; numeric-arg
    ("~\\([0-9]*\\|#,?\\)\\(@:?\\|:@?\\)?[i*%&|~{[]"
     (0 font-lock-keyword-face))
    ;; decimal
    ("~\\([0-9]*\\|#\\(,[0-9]*\\|#\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?[rdbox]"
     (0 font-lock-keyword-face))
    ;; floating-point f
    (,(concat
       "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,2\\}\\)\\|"
       "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)"
       "?\\(@:?\\|:@?\\)?f")
     (0 font-lock-keyword-face))
    ;; floating-point e, g
    (,(concat
       "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,3\\}\\)\\|"
       "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)"
       "?\\(@:?\\|:@?\\)?[eg]")
     (0 font-lock-keyword-face))
    ;; currency
    (,(concat
       "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{2\\}\\(,'\\w\\)\\)\\|"
       "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)\\)"
       "?\\(@:?\\|:@?\\)?[$]")
     (0 font-lock-keyword-face))
    ;; tabulation
    ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)?\\)?\\(@:?\\|:@?\\)?t"
     (0 font-lock-keyword-face))
    ;; escape
    ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)?\\(@:?\\|:@?\\)?^"
     (0 font-lock-keyword-face))
    ;; logical block
    ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?<"
     (0 font-lock-keyword-face))
    ;; custom function
    (,(concat
       "~\\(\\([0-9]+\\|'\\w\\|#\\)\\(,\\([0-9]+\\|'\\w\\|#\\)+\\)*\\)?"
       "\\(@:?\\|:@?\\)?\\/[^\\s\\n,#@]+\\/")
     (0 font-lock-keyword-face))))

This is my previous version of font-lock coloring. I don't expect you
to read through it, but just to make the point even more obvious: this
is actually a single regular expression, which I chopped into pieces
for "ease" of use. The lexer based on regexp would need to have this
mess concatenated into a single expression. Maybe it can be
simplified, but not by much. The corresponding parsing function, which
doesn't use regular expressions would be somewhere between 1/3 and 1/2
of the above code, and it would be perfectly understandable. This is
the case similar to email parsing: you can do it with regular grammar,
in principle, but there is no good way to do it in practice.

I later found that I can provide a function to font-lock to replace
this mess, and I would be happy if there was a way to do the same in
place of the Semantic lexer.

Best,

Oleg

On Sun, Aug 17, 2014 at 1:12 AM, Left Right <ole...@gm...> wrote:
> Hi and thanks for thorough replies. I think I will need to go through
> them again, but so far I have one questions, which may possibly spare
> me trying to intern all of this material.
>
> My last question wrt define-lex-regex wasn't about how could I make a
> regular expression based lexer. The truth is: I don't want any regular
> expressions there, they are just not cut for the task. So, let me
> rephrase it: can I ditch the whole mechanism of the lexer and replace
> it with a brand new code, which will handle the tokenization? What
> will I need to do to achieve this? I don't have much experience with
> writing lexers, in fact, I only ever used cl-yacc, which simply leaves
> it to the programmer to implement a lexer and only requires a very
> simple interface: a function that accepts an input stream and returns
> a token.
>
> The reason I'm asking: the tokenizer mechanism looks very complex to
> me, way more complex than what I need, besides, it works in a very
> inconvenient way: if I could keep the state between the calls to the
> lexer it would make my life so much easier. I also don't want to
> depend on syntax table and whatever bizarre rules Emacs uses to
> understand the syntax: this is not your fault, but I have an
> impression that many of these rules are there purely by accident, they
> are hard to discover and even harder to understand, because they don't
> match anything you might come to expect from a lexer / parser. It
> would be just a whole lot easier to start fresh, than to try to glue
> together a bunch of jigsaw puzzle pieces, clearly taken from different
> puzzles.
>
> I can understand the motivation for someone who gets the syntax table
> and mode coloring for free from an existing mode and wants to reuse it
> in order to build the lexer. I don't have these preconditions, and
> even the little that I do have, I've written myself, and I'd rather
> give it up to make the overall process more consistent. I.e. I don't
> want to have to design font-lock rules, syntax table and the lexer
> separately: to me this would be like doing the same work twice, but
> both times using inappropriate tools.
>
> Best,
>
> Oleg
>
> On Sun, Aug 10, 2014 at 5:41 PM, Eric M. Ludlam <er...@si...> wrote:
>> On 08/02/2014 05:28 PM, Left Right wrote:
>>>
>>> One more question, I'm trying to follow the inline code documentation,
>>> and here's something I came up with, but I have lots of questions
>>> about it:
>>>
>>> (define-lex-regex-analyzer fmt-lex-filler
>>>    "Matches the filler in the format string."
>>>    "[^~]+"
>>>    (semantic-lex-push-token
>>>       (semantic-lex-token
>>>        'filler (match-beginning 0) (match-end 0))))
>>>
>>> (define-lex wisent-fmt-lexer
>>>    "Lexical analyzer that handles Common Lisp format."
>>>    fmt-lex-filler)
>>>
>>> 1. Using regular expression in this analyzer is a really, really bad
>>> idea (the proper regexp is more than 300 characters long, this one is
>>> here just for illustration), but this complexity can be easily avoided
>>> if instead of regular expression I could use a function that takes,
>>> say, position in the buffer or something like that: is that even
>>> possible?
>>>
>>> 2. 'filler isn't a default kind of token, is my guess correct that I
>>> can somehow refer to this kind in the grammar, similar to how %type
>>> <symbol>  is defined, maybe? What would I need to do to make this
>>> possible?
>>
>>
>> There is a default whitespace token you can create from your lexers. For
>> exmaple, the dot lexer starts with these:
>>
>>   semantic-lex-ignore-whitespace
>>   semantic-lex-ignore-newline
>>   semantic-lex-ignore-comments
>>
>> which is implemented like this:
>>
>> (define-lex-regex-analyzer semantic-lex-ignore-whitespace
>>   "Detect and skip over whitespace tokens."
>>   ;; catch whitespace when needed
>>   "\\s-+"
>>   ;; Skip over the detected whitespace, do not create a token for it.
>>   (setq semantic-lex-end-point (match-end 0)))
>>
>>
>> which means "go to the end of the match, and don't return a token.  As you
>> have in your lexer, you have to push the 'filler token to get it on the
>> stack.
>>
>> The reason you have to set the end point is because when you push a token,
>> it looks at the end of your token, and moves there automatically, but if you
>> don't push a token, you have to move it by hand.
>>
>> Lexical analyzers are interesting, in that while a function is made for
>> them, those functions aren't used.  Instead they also have a value, and
>> those values are concatenated together to create the master lexer function.
>> Like a big cond statement.  The main lexer has logic it applies after each
>> match is found, and that is where a bunch of the magic happens.
>>
>> If you aren't trying to ignore your 'filler tokens, you will instead need a
>> %token declaration for it, such as:
>>
>> %token filler
>>
>> If you instead had
>>
>> %type<filler> syntax "[^~]+"
>>
>> you wouldn't need to write your lexical analyzer at all and one would be
>> provided for you.  (I think, I'm a little fuzzy on that one.)
>>
>> Your filler lexer is OK if it is something you really need, but because it
>> can match so much, you MUST put it at the END of your defined lexer.  That
>> way you will be able to match all your other expressions, and if nothing
>> works, you call it filler.
>>
>> use:
>>
>> M-x semantic-lex-test RET
>>
>> to see how it works, or
>>
>> M-x semantic-lex-debug RET
>>
>> to watch your lexer run.
>>
>> Good Luck
>> Eric

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Eric M. L. <er...@si...> - 2014-08-10 14:26:08

On 08/02/2014 07:59 AM, Left Right wrote:
> Hi Eric,
>
> Sorry it took me so long to reply. I was finally able to at least get
> the dot-mode to work. The way I managed was by requiring:
>
> (require 'cogre/dot-mode)
> (require 'cogre/wisent-dot)
> (require 'cogre/wisent-dot-wy)
>
> I also needed to update from Semantic bundled with Emacs 24.3.50 to
> the one I pulled from VCS today, otherwise, as I discovered post
> factum, it was trying to use a different parser (LR(1) instead of LL),
> I'm not sure how does this change come about, since the dot mode files
> didn't change across the versions. Yet when it was reading the grammar
> using LR parser, it would run into shift/reduce conflicts.

Hi Oleg,

My setup for CEDET in my .emacs is basically the same as in the INSTALL 
file with the version of CEDET from BZR, and that will load up .dot 
files just fine.   It is surprising to me you need all the extra loads. 
  Perhaps the build didn't create the autoload files for you?

> I'm still struggling with my mode though, and, if you will be so kind,
> could you, please, explain few things about dot grammar?
>
> %type<punctuation>  syntax "\\s.+"

In this case \s means "match a syntax type", and the "." means the 
syntax code for punctuation.   The \\ is quoting in one slash.

Here's a doc snippet:

`\sCODE'
      matches any character whose syntax is CODE.  Here CODE is a
      character that represents a syntax code: thus, `w' for word
      constituent, `-' for whitespace, `(' for open parenthesis, etc.
      To represent whitespace syntax, use either `-' or a space
      character.  *Note Syntax Class Table::, for a list of syntax codes
      and the characters that stand for them.

so the whole statement is "Create lexical tokens of type punctuation 
that matches the regular expression of punctuation from the Emacs syntax 
table.  In otherwords, it is a statement translating from Emacs speak to 
lexer speak.

> I searched high and low, but I can't find an exhaustive reference to
> Emacs-style regexp, therefore I can't tell for sure what does this
> regexp mean: but I came to believe that it means a single "whitespace"
> character followed by whatever. I can't understand the meaning of this
> line, despite reading the documentation:

There is a doc node in the "Elisp" manual called "Syntax of Regular 
Expressions" that I use.

> ---- begin quote ----
>
> — %-Decl: %type<type-name>  [property1 value1 ...]
>
> Explicitly declare a lexical type, and optionally give it properties.
>
> type-nameIs a symbol that identifies the type.

This would be a type for the lexer.

> propertyIs a property name, a valid Emacs Lisp symbol.
> valueIs a property value, a valid Emacs Lisp constant expression.

This lets you specify that the syntax (the property) matches some 
regexp.  If you leave it blank there are some handy defaults.

> Even if %token, %keyword, and precedence declarations can implicitly
> declare types, an explicit declaration is required for every type:
>
> To assign it properties.
> To auto-generate a lexical rule that detects tokens of this type. For
> more information, Auto-generation of lexical rules.
>
> ---- end quote ----
>
> What does this declaration do? This looks suspiciously similar to the
> entries in syntax table, but then it doesn't make much sense, since
> Emacs has a different way to mark punctuation...
>
> Second:
>
> %token<block>        BRACKET_BLOCK "(LBRACKET RBRACKET)"
>
> ---- begin quote ----

Once you have a lexical %type you can create %tokens that are more 
specific.   For example you might say"

%type <punctuation>  syntax "\\s."

to match a single punctuation, and then say

%token <punctuation> PLUS "+"

to create a token you can use in your grammar called plus.

This two step process lets the lexer quickly find your punctuation, and 
then convert generic punctuation into handy named tokens for use in your 
grammar.

<block> tokens are special in that the Emacs syntax table supports block 
concepts, and we use blocks to speed up grammar parsing.  While unusual 
in grammars, it lets us parse buffers more quickly by skipping over 
large chunks of text.

Thus the combination of:

%type  <block>
%token <block>       BRACKET_BLOCK "(LBRACKET RBRACKET)"
%token <open-paren>  LBRACKET      "["
%token <close-paren> RBRACKET      "]"

Says "I have a %type in my lexer called block".
I can create a <block> token that is composed of the LBRACKET and RBRACKET.
I have an <open-parent> lexical type called LBRACKET which matches [.

Then the lexer has a special 'depth' parameter, and if set to 0, will 
return BRACKET_BLOCK.  IN the match of BRACKET_BLOCK you can expand, and 
then get the LBRACKET token, like this:

graphgeneric
   : GRAPH BRACKET_BLOCK SEMI
     (TAG "GRAPH" 'generic-graph :attributes (EXPANDFULL $2 
attribute-block))
   ;

where EXPANDFULL on $2 says "run this grammar again on the buffer 
contents inside $2 (the BRACKET_BLOCK) starting with the grammar symbol 
"attribute-block".  The lexer will be run on that block with a depth of 
1, forcing it to look inside the parens (or brackets).

attribute-block
   : LBRACKET
     ()
   | RBRACKET
     ()
   | COMMA
     ()
  ;; This is a catch-all in case we miss some keyword.
   | symbol EQUAL name
     (TAG $1 'attribute :value $3)
   ;

so now in bracket block, we match the brackets, commas, etc, and just 
start creating tags for each attribute name found.

This set of nested tags needs to be matched outside of the dot grammar 
with a function for expanding tags.  In wisent-dot.el you will find 
semantic-tag-components which matches 'generic-graph from the first 
rule, and returns the :attributes which is the list of tags created with 
attribute-block.

> The %token statement declares a terminal symbol (a token) which is not
> a keyword.
>
> — %-Decl: %token [<type-name>] token-name match-value
> — %-Decl: %token [<type-name>] token-name1 ...
>
> Respectively declare one token with an optional type, and a match
> value, or several tokens with the optional same type, and no match
> value.
>
> type-nameIs an optional symbol, enclosed between<  and>, that
> specifies (and implicitly declares) a type for this token (see type
> Decl). If omitted the token has no type.
> token-nameIs the terminal symbol used in grammar rules to represent this token.
> match-valueIs an optional string. Depending on type-name properties,
> it will be interpreted as an ordinary string, a regular expression, or
> have a more elaborate meaning. If omitted the match value will be nil,
> which means that this token will be considered as the default token of
> its type (see type Decl for more information).
>
> ---- end quote ----
>
> The documentation speaks about some "more elaborate meaning". Can you
> tell, please, what is this meaning? The two things inside the
> parenthesis are another tokens which match literal brackets, but does
> this one match "[]" or "\\[[^\\]]+\\]"?

You are now pushing the boundary of what I am familiar with as I didn't 
develop most of this system.  Perhaps my earlier examples helped?

> Third:
>
> ;;; Bland default types
> %type<symbol>
> %token<symbol>  symbol
>
> %type<string>
> %token<string>  string
>
> %type<number>
> %token<number>  number
>
> I understand what is this supposed to do, but I can't understand how
> it achieves that. Can you, please, interpret that in words? To me this
> looks like magic: how does token `number' know how to match numbers?

In the wisent-dot example, the grammar code:

;;; Bland default types
%type  <symbol>
%token <symbol> symbol

is matched with:

(define-lex wisent-dot-lexer
   "Lexical analyzer that handles DOT buffers.
It ignores whitespace, newlines and comments."

...

   wisent-dot-wy--<symbol>-regexp-analyzer

and there is code generated like this in wisent-dot-wy.el

(define-lex-regex-type-analyzer wisent-dot-wy--<symbol>-regexp-analyzer
   "regexp analyzer for <symbol> tokens."
   "\\(\\sw\\|\\s_\\)+"
   nil
   'symbol)

so basically, there are default regular expressions for many types like 
symbol that will autogenerate lexer pieces.  You still need to assemble 
your lexer by ordering the pieces from most specific to most generic.

This is mostly derived from the fact that Emacs has a built-in 
lexer-like thing created using syntax tables.  A good major-mode defines 
a good syntax table, and then the lexer can be very simple, basically 
matching up syntax types via \\s to lexical types needed by the parser. 
  You can then overlay more specific token types on top of those.   By 
using the syntax table, the semantic lexer takes advantage of the Emacs 
scanners built in C, and can go very fast.

I hope that helps.
Eric

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Eric M. L. <er...@si...> - 2014-08-17 13:07:55

Hi Oleg,

Replacing the lexer is pretty easy.   In most languages (I'll use dot 
again) there is a line that looks like this:

   (setq
    ;; Lexical Analysis
    semantic-lex-analyzer 'wisent-dot-lexer
    ...

which was created like this:

(define-lex wisent-dot-lexer
   "Lexical analyzer that handles DOT buffers.
It ignores whitespace, newlines and comments."
   semantic-lex-ignore-whitespace
   ...

If you follow the doc trail, you end up with this:

-----------
semantic-lex is an autoloaded compiled Lisp function in `lex.el'.

(semantic-lex START END &optional DEPTH LENGTH)

Lexically analyze text in the current buffer between START and END.
Optional argument DEPTH indicates at what level to scan over entire
lists.  The last argument, LENGTH specifies that `semantic-lex'
should only return LENGTH tokens.  The return value is a token stream.
Each element is a list, such of the form
   (symbol start-expression .  end-expression)
where SYMBOL denotes the token type.
See `semantic-lex-tokens' variable for details on token types.  END
does not mark the end of the text scanned, only the end of the
beginning of text scanned.  Thus, if a string extends past END, the
end of the return token will be larger than END.  To truly restrict
scanning, use `narrow-to-region'.
----------

So this function will parse an entire buffer and return all the lexical 
tokens for it.

You can put anything you want in there, and return tokens with any old 
SYMBOL you want too.

Semantic's first lexer (see semantic-flex) is an example of a different 
standalone lexer.  It was only after struggling with that for a while 
that the mechanism for making customer lexers came up.  It is partly 
modeled after lex/flex where regexp are associated with actions.

If regexp don't make sense for your language, then rolling your own is 
no problem.   You still need to add %token expressions in your grammar 
to let the grammar know what is going on though.  You just don't need to 
specify all the regexp along the way, or use the automatically generated 
lexers.

Your big scary regexp below is not really necessary for writing lexers 
using the lexing technique I described last time though.  Each 
expression only needs to be as long as the small piece you are looking 
at (ie - a number or symbol).   If you find that your language has a 
lexical token whose type is based on previous lexical tokens, then you 
are right that the built in lexer is probably insufficient.

The C parser has examples that parse out things like:

#include <foo.h>

and

#if SOMESYMBOL
#endif

that way, but it gets pretty hairy.

Eric

On 08/16/2014 06:52 PM, Left Right wrote:
> Just to give you a sense of what I /don't/ want to have in my code
> (below is my own code, so I'm allowed to say that it's an
> unmaintainable cuneiform)
>
> (defvar fmt-font-lock-keywords
>    ;; no-args
>    `(("~\\(@:?\\|:@?\\)?[]>()}aswvcp;_]"
>       (0 font-lock-keyword-face))
>      ;; numeric-arg
>      ("~\\([0-9]*\\|#,?\\)\\(@:?\\|:@?\\)?[i*%&|~{[]"
>       (0 font-lock-keyword-face))
>      ;; decimal
>      ("~\\([0-9]*\\|#\\(,[0-9]*\\|#\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?[rdbox]"
>       (0 font-lock-keyword-face))
>      ;; floating-point f
>      (,(concat
>         "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,2\\}\\)\\|"
>         "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)"
>         "?\\(@:?\\|:@?\\)?f")
>       (0 font-lock-keyword-face))
>      ;; floating-point e, g
>      (,(concat
>         "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,3\\}\\)\\|"
>         "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)"
>         "?\\(@:?\\|:@?\\)?[eg]")
>       (0 font-lock-keyword-face))
>      ;; currency
>      (,(concat
>         "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{2\\}\\(,'\\w\\)\\)\\|"
>         "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)\\)"
>         "?\\(@:?\\|:@?\\)?[$]")
>       (0 font-lock-keyword-face))
>      ;; tabulation
>      ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)?\\)?\\(@:?\\|:@?\\)?t"
>       (0 font-lock-keyword-face))
>      ;; escape
>      ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)?\\(@:?\\|:@?\\)?^"
>       (0 font-lock-keyword-face))
>      ;; logical block
>      ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?<"
>       (0 font-lock-keyword-face))
>      ;; custom function
>      (,(concat
>         "~\\(\\([0-9]+\\|'\\w\\|#\\)\\(,\\([0-9]+\\|'\\w\\|#\\)+\\)*\\)?"
>         "\\(@:?\\|:@?\\)?\\/[^\\s\\n,#@]+\\/")
>       (0 font-lock-keyword-face))))
>
> This is my previous version of font-lock coloring. I don't expect you
> to read through it, but just to make the point even more obvious: this
> is actually a single regular expression, which I chopped into pieces
> for "ease" of use. The lexer based on regexp would need to have this
> mess concatenated into a single expression. Maybe it can be
> simplified, but not by much. The corresponding parsing function, which
> doesn't use regular expressions would be somewhere between 1/3 and 1/2
> of the above code, and it would be perfectly understandable. This is
> the case similar to email parsing: you can do it with regular grammar,
> in principle, but there is no good way to do it in practice.
>
> I later found that I can provide a function to font-lock to replace
> this mess, and I would be happy if there was a way to do the same in
> place of the Semantic lexer.
>
> Best,
>
> Oleg
>
> On Sun, Aug 17, 2014 at 1:12 AM, Left Right<ole...@gm...>  wrote:
>> Hi and thanks for thorough replies. I think I will need to go through
>> them again, but so far I have one questions, which may possibly spare
>> me trying to intern all of this material.
>>
>> My last question wrt define-lex-regex wasn't about how could I make a
>> regular expression based lexer. The truth is: I don't want any regular
>> expressions there, they are just not cut for the task. So, let me
>> rephrase it: can I ditch the whole mechanism of the lexer and replace
>> it with a brand new code, which will handle the tokenization? What
>> will I need to do to achieve this? I don't have much experience with
>> writing lexers, in fact, I only ever used cl-yacc, which simply leaves
>> it to the programmer to implement a lexer and only requires a very
>> simple interface: a function that accepts an input stream and returns
>> a token.
>>
>> The reason I'm asking: the tokenizer mechanism looks very complex to
>> me, way more complex than what I need, besides, it works in a very
>> inconvenient way: if I could keep the state between the calls to the
>> lexer it would make my life so much easier. I also don't want to
>> depend on syntax table and whatever bizarre rules Emacs uses to
>> understand the syntax: this is not your fault, but I have an
>> impression that many of these rules are there purely by accident, they
>> are hard to discover and even harder to understand, because they don't
>> match anything you might come to expect from a lexer / parser. It
>> would be just a whole lot easier to start fresh, than to try to glue
>> together a bunch of jigsaw puzzle pieces, clearly taken from different
>> puzzles.
>>
>> I can understand the motivation for someone who gets the syntax table
>> and mode coloring for free from an existing mode and wants to reuse it
>> in order to build the lexer. I don't have these preconditions, and
>> even the little that I do have, I've written myself, and I'd rather
>> give it up to make the overall process more consistent. I.e. I don't
>> want to have to design font-lock rules, syntax table and the lexer
>> separately: to me this would be like doing the same work twice, but
>> both times using inappropriate tools.
>>
>> Best,
>>
>> Oleg
>>
>> On Sun, Aug 10, 2014 at 5:41 PM, Eric M. Ludlam<er...@si...>  wrote:
>>> On 08/02/2014 05:28 PM, Left Right wrote:
>>>>
>>>> One more question, I'm trying to follow the inline code documentation,
>>>> and here's something I came up with, but I have lots of questions
>>>> about it:
>>>>
>>>> (define-lex-regex-analyzer fmt-lex-filler
>>>>     "Matches the filler in the format string."
>>>>     "[^~]+"
>>>>     (semantic-lex-push-token
>>>>        (semantic-lex-token
>>>>         'filler (match-beginning 0) (match-end 0))))
>>>>
>>>> (define-lex wisent-fmt-lexer
>>>>     "Lexical analyzer that handles Common Lisp format."
>>>>     fmt-lex-filler)
>>>>
>>>> 1. Using regular expression in this analyzer is a really, really bad
>>>> idea (the proper regexp is more than 300 characters long, this one is
>>>> here just for illustration), but this complexity can be easily avoided
>>>> if instead of regular expression I could use a function that takes,
>>>> say, position in the buffer or something like that: is that even
>>>> possible?
>>>>
>>>> 2. 'filler isn't a default kind of token, is my guess correct that I
>>>> can somehow refer to this kind in the grammar, similar to how %type
>>>> <symbol>   is defined, maybe? What would I need to do to make this
>>>> possible?
>>>
>>>
>>> There is a default whitespace token you can create from your lexers. For
>>> exmaple, the dot lexer starts with these:
>>>
>>>    semantic-lex-ignore-whitespace
>>>    semantic-lex-ignore-newline
>>>    semantic-lex-ignore-comments
>>>
>>> which is implemented like this:
>>>
>>> (define-lex-regex-analyzer semantic-lex-ignore-whitespace
>>>    "Detect and skip over whitespace tokens."
>>>    ;; catch whitespace when needed
>>>    "\\s-+"
>>>    ;; Skip over the detected whitespace, do not create a token for it.
>>>    (setq semantic-lex-end-point (match-end 0)))
>>>
>>>
>>> which means "go to the end of the match, and don't return a token.  As you
>>> have in your lexer, you have to push the 'filler token to get it on the
>>> stack.
>>>
>>> The reason you have to set the end point is because when you push a token,
>>> it looks at the end of your token, and moves there automatically, but if you
>>> don't push a token, you have to move it by hand.
>>>
>>> Lexical analyzers are interesting, in that while a function is made for
>>> them, those functions aren't used.  Instead they also have a value, and
>>> those values are concatenated together to create the master lexer function.
>>> Like a big cond statement.  The main lexer has logic it applies after each
>>> match is found, and that is where a bunch of the magic happens.
>>>
>>> If you aren't trying to ignore your 'filler tokens, you will instead need a
>>> %token declaration for it, such as:
>>>
>>> %token filler
>>>
>>> If you instead had
>>>
>>> %type<filler>  syntax "[^~]+"
>>>
>>> you wouldn't need to write your lexical analyzer at all and one would be
>>> provided for you.  (I think, I'm a little fuzzy on that one.)
>>>
>>> Your filler lexer is OK if it is something you really need, but because it
>>> can match so much, you MUST put it at the END of your defined lexer.  That
>>> way you will be able to match all your other expressions, and if nothing
>>> works, you call it filler.
>>>
>>> use:
>>>
>>> M-x semantic-lex-test RET
>>>
>>> to see how it works, or
>>>
>>> M-x semantic-lex-debug RET
>>>
>>> to watch your lexer run.
>>>
>>> Good Luck
>>> Eric
>

Re: [CEDET-devel] senator-next-tag: Buffer was not parsed by Semantic.

From: Left R. <ole...@gm...> - 2014-08-17 14:52:33

Thanks, I'll try it this weekend.

Re'

> Your big scary regexp below is not really necessary for writing lexers using the lexing technique I described last time though.  Each expression only needs to be as long as the small piece you are looking at (ie - a number or symbol).   If you find that your language has a lexical token whose type is based on previous lexical tokens, then you are right that the built in lexer is probably insufficient.

Lo and behold, that huge regexp matches things like "~^",
"~10,20,'xf", "~#[" and so on. Most of which are two or three
characters long! It's long not because the text it matches is long,
it's long because it's difficult to match the text precisely (You
probably saw this already, but if not, you might enjoy a good laugh:
http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html ).

Best,

Oleg

On Sun, Aug 17, 2014 at 4:07 PM, Eric M. Ludlam <er...@si...> wrote:
> Hi Oleg,
>
> Replacing the lexer is pretty easy.   In most languages (I'll use dot again)
> there is a line that looks like this:
>
>   (setq
>    ;; Lexical Analysis
>    semantic-lex-analyzer 'wisent-dot-lexer
>    ...
>
> which was created like this:
>
>
> (define-lex wisent-dot-lexer
>   "Lexical analyzer that handles DOT buffers.
> It ignores whitespace, newlines and comments."
>   semantic-lex-ignore-whitespace
>   ...
>
> If you follow the doc trail, you end up with this:
>
> -----------
> semantic-lex is an autoloaded compiled Lisp function in `lex.el'.
>
> (semantic-lex START END &optional DEPTH LENGTH)
>
> Lexically analyze text in the current buffer between START and END.
> Optional argument DEPTH indicates at what level to scan over entire
> lists.  The last argument, LENGTH specifies that `semantic-lex'
> should only return LENGTH tokens.  The return value is a token stream.
> Each element is a list, such of the form
>   (symbol start-expression .  end-expression)
> where SYMBOL denotes the token type.
> See `semantic-lex-tokens' variable for details on token types.  END
> does not mark the end of the text scanned, only the end of the
> beginning of text scanned.  Thus, if a string extends past END, the
> end of the return token will be larger than END.  To truly restrict
> scanning, use `narrow-to-region'.
> ----------
>
> So this function will parse an entire buffer and return all the lexical
> tokens for it.
>
> You can put anything you want in there, and return tokens with any old
> SYMBOL you want too.
>
> Semantic's first lexer (see semantic-flex) is an example of a different
> standalone lexer.  It was only after struggling with that for a while that
> the mechanism for making customer lexers came up.  It is partly modeled
> after lex/flex where regexp are associated with actions.
>
> If regexp don't make sense for your language, then rolling your own is no
> problem.   You still need to add %token expressions in your grammar to let
> the grammar know what is going on though.  You just don't need to specify
> all the regexp along the way, or use the automatically generated lexers.
>
> Your big scary regexp below is not really necessary for writing lexers using
> the lexing technique I described last time though.  Each expression only
> needs to be as long as the small piece you are looking at (ie - a number or
> symbol).   If you find that your language has a lexical token whose type is
> based on previous lexical tokens, then you are right that the built in lexer
> is probably insufficient.
>
> The C parser has examples that parse out things like:
>
> #include <foo.h>
>
> and
>
> #if SOMESYMBOL
> #endif
>
> that way, but it gets pretty hairy.
>
> Eric
>
>
> On 08/16/2014 06:52 PM, Left Right wrote:
>>
>> Just to give you a sense of what I /don't/ want to have in my code
>> (below is my own code, so I'm allowed to say that it's an
>> unmaintainable cuneiform)
>>
>> (defvar fmt-font-lock-keywords
>>    ;; no-args
>>    `(("~\\(@:?\\|:@?\\)?[]>()}aswvcp;_]"
>>       (0 font-lock-keyword-face))
>>      ;; numeric-arg
>>      ("~\\([0-9]*\\|#,?\\)\\(@:?\\|:@?\\)?[i*%&|~{[]"
>>       (0 font-lock-keyword-face))
>>      ;; decimal
>>
>> ("~\\([0-9]*\\|#\\(,[0-9]*\\|#\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?[rdbox]"
>>       (0 font-lock-keyword-face))
>>      ;; floating-point f
>>      (,(concat
>>
>> "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,2\\}\\)\\|"
>>         "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)"
>>         "?\\(@:?\\|:@?\\)?f")
>>       (0 font-lock-keyword-face))
>>      ;; floating-point e, g
>>      (,(concat
>>
>> "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{3\\}\\(,'\\w\\)\\{1,3\\}\\)\\|"
>>         "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)\\)"
>>         "?\\(@:?\\|:@?\\)?[eg]")
>>       (0 font-lock-keyword-face))
>>      ;; currency
>>      (,(concat
>>
>> "~\\(\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{2\\}\\(,'\\w\\)\\)\\|"
>>         "\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)\\)"
>>         "?\\(@:?\\|:@?\\)?[$]")
>>       (0 font-lock-keyword-face))
>>      ;; tabulation
>>      ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)?\\)?\\(@:?\\|:@?\\)?t"
>>       (0 font-lock-keyword-face))
>>      ;; escape
>>
>> ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,2\\}\\)?\\(@:?\\|:@?\\)?^"
>>       (0 font-lock-keyword-face))
>>      ;; logical block
>>
>> ("~\\(\\([0-9]*\\|#\\)\\(,\\([0-9]*\\|#\\)\\)\\{0,3\\}\\)?\\(@:?\\|:@?\\)?<"
>>       (0 font-lock-keyword-face))
>>      ;; custom function
>>      (,(concat
>>         "~\\(\\([0-9]+\\|'\\w\\|#\\)\\(,\\([0-9]+\\|'\\w\\|#\\)+\\)*\\)?"
>>         "\\(@:?\\|:@?\\)?\\/[^\\s\\n,#@]+\\/")
>>       (0 font-lock-keyword-face))))
>>
>> This is my previous version of font-lock coloring. I don't expect you
>> to read through it, but just to make the point even more obvious: this
>> is actually a single regular expression, which I chopped into pieces
>> for "ease" of use. The lexer based on regexp would need to have this
>> mess concatenated into a single expression. Maybe it can be
>> simplified, but not by much. The corresponding parsing function, which
>> doesn't use regular expressions would be somewhere between 1/3 and 1/2
>> of the above code, and it would be perfectly understandable. This is
>> the case similar to email parsing: you can do it with regular grammar,
>> in principle, but there is no good way to do it in practice.
>>
>> I later found that I can provide a function to font-lock to replace
>> this mess, and I would be happy if there was a way to do the same in
>> place of the Semantic lexer.
>>
>> Best,
>>
>> Oleg
>>
>> On Sun, Aug 17, 2014 at 1:12 AM, Left Right<ole...@gm...>  wrote:
>>>
>>> Hi and thanks for thorough replies. I think I will need to go through
>>> them again, but so far I have one questions, which may possibly spare
>>> me trying to intern all of this material.
>>>
>>> My last question wrt define-lex-regex wasn't about how could I make a
>>> regular expression based lexer. The truth is: I don't want any regular
>>> expressions there, they are just not cut for the task. So, let me
>>> rephrase it: can I ditch the whole mechanism of the lexer and replace
>>> it with a brand new code, which will handle the tokenization? What
>>> will I need to do to achieve this? I don't have much experience with
>>> writing lexers, in fact, I only ever used cl-yacc, which simply leaves
>>> it to the programmer to implement a lexer and only requires a very
>>> simple interface: a function that accepts an input stream and returns
>>> a token.
>>>
>>> The reason I'm asking: the tokenizer mechanism looks very complex to
>>> me, way more complex than what I need, besides, it works in a very
>>> inconvenient way: if I could keep the state between the calls to the
>>> lexer it would make my life so much easier. I also don't want to
>>> depend on syntax table and whatever bizarre rules Emacs uses to
>>> understand the syntax: this is not your fault, but I have an
>>> impression that many of these rules are there purely by accident, they
>>> are hard to discover and even harder to understand, because they don't
>>> match anything you might come to expect from a lexer / parser. It
>>> would be just a whole lot easier to start fresh, than to try to glue
>>> together a bunch of jigsaw puzzle pieces, clearly taken from different
>>> puzzles.
>>>
>>> I can understand the motivation for someone who gets the syntax table
>>> and mode coloring for free from an existing mode and wants to reuse it
>>> in order to build the lexer. I don't have these preconditions, and
>>> even the little that I do have, I've written myself, and I'd rather
>>> give it up to make the overall process more consistent. I.e. I don't
>>> want to have to design font-lock rules, syntax table and the lexer
>>> separately: to me this would be like doing the same work twice, but
>>> both times using inappropriate tools.
>>>
>>> Best,
>>>
>>> Oleg
>>>
>>> On Sun, Aug 10, 2014 at 5:41 PM, Eric M. Ludlam<er...@si...>
>>> wrote:
>>>>
>>>> On 08/02/2014 05:28 PM, Left Right wrote:
>>>>>
>>>>>
>>>>> One more question, I'm trying to follow the inline code documentation,
>>>>> and here's something I came up with, but I have lots of questions
>>>>> about it:
>>>>>
>>>>> (define-lex-regex-analyzer fmt-lex-filler
>>>>>     "Matches the filler in the format string."
>>>>>     "[^~]+"
>>>>>     (semantic-lex-push-token
>>>>>        (semantic-lex-token
>>>>>         'filler (match-beginning 0) (match-end 0))))
>>>>>
>>>>> (define-lex wisent-fmt-lexer
>>>>>     "Lexical analyzer that handles Common Lisp format."
>>>>>     fmt-lex-filler)
>>>>>
>>>>> 1. Using regular expression in this analyzer is a really, really bad
>>>>> idea (the proper regexp is more than 300 characters long, this one is
>>>>> here just for illustration), but this complexity can be easily avoided
>>>>> if instead of regular expression I could use a function that takes,
>>>>> say, position in the buffer or something like that: is that even
>>>>> possible?
>>>>>
>>>>> 2. 'filler isn't a default kind of token, is my guess correct that I
>>>>> can somehow refer to this kind in the grammar, similar to how %type
>>>>> <symbol>   is defined, maybe? What would I need to do to make this
>>>>> possible?
>>>>
>>>>
>>>>
>>>> There is a default whitespace token you can create from your lexers. For
>>>> exmaple, the dot lexer starts with these:
>>>>
>>>>    semantic-lex-ignore-whitespace
>>>>    semantic-lex-ignore-newline
>>>>    semantic-lex-ignore-comments
>>>>
>>>> which is implemented like this:
>>>>
>>>> (define-lex-regex-analyzer semantic-lex-ignore-whitespace
>>>>    "Detect and skip over whitespace tokens."
>>>>    ;; catch whitespace when needed
>>>>    "\\s-+"
>>>>    ;; Skip over the detected whitespace, do not create a token for it.
>>>>    (setq semantic-lex-end-point (match-end 0)))
>>>>
>>>>
>>>> which means "go to the end of the match, and don't return a token.  As
>>>> you
>>>> have in your lexer, you have to push the 'filler token to get it on the
>>>> stack.
>>>>
>>>> The reason you have to set the end point is because when you push a
>>>> token,
>>>> it looks at the end of your token, and moves there automatically, but if
>>>> you
>>>> don't push a token, you have to move it by hand.
>>>>
>>>> Lexical analyzers are interesting, in that while a function is made for
>>>> them, those functions aren't used.  Instead they also have a value, and
>>>> those values are concatenated together to create the master lexer
>>>> function.
>>>> Like a big cond statement.  The main lexer has logic it applies after
>>>> each
>>>> match is found, and that is where a bunch of the magic happens.
>>>>
>>>> If you aren't trying to ignore your 'filler tokens, you will instead
>>>> need a
>>>> %token declaration for it, such as:
>>>>
>>>> %token filler
>>>>
>>>> If you instead had
>>>>
>>>> %type<filler>  syntax "[^~]+"
>>>>
>>>> you wouldn't need to write your lexical analyzer at all and one would be
>>>> provided for you.  (I think, I'm a little fuzzy on that one.)
>>>>
>>>> Your filler lexer is OK if it is something you really need, but because
>>>> it
>>>> can match so much, you MUST put it at the END of your defined lexer.
>>>> That
>>>> way you will be able to match all your other expressions, and if nothing
>>>> works, you call it filler.
>>>>
>>>> use:
>>>>
>>>> M-x semantic-lex-test RET
>>>>
>>>> to see how it works, or
>>>>
>>>> M-x semantic-lex-debug RET
>>>>
>>>> to watch your lexer run.
>>>>
>>>> Good Luck
>>>> Eric
>>
>>
>