Re[1]: [CEDET-devel] Incremental parser behavior

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi David,

  I certainly see the how problem you are describing could be a
problem.  Unfortunately I fear the fixes you propose in 1 and 2 are
problematic.

  The reparse-symbol, as used in semantic 1.4 is for tags found inside
other tokens.  I expanded on the original use by adding smarts for
splicing new tags in and out of the master cache, within the
child-list of some parent token. To remove the incremental parser for
child tokens would make the incremental parser nearly useless for
Java, where 90% of the file is taken up by one class.

  More below.

>>> David Ponce <da...@dp...> seems to think that:
>Hi Eric,
>
>While I was hacking WY grammars, I got some problems with the
>incremental parser when adding new tokens between existing ones.
>
>That is when `semantic-edits-change-between-tokens' returns
>something.  Here is a summary:
>
>1. The `reparse-symbol' property can't be retrieved.  After
>    `semantic-edits-change-between-tokens' returned a value the
>    variable `tokens' is set to nil.  So the following statement that
>    retrieve the reparse-symbol always fails:
>
>             (setq reparse-symbol (semantic-token-get
>                                   (car tokens) 'reparse-symbol))
>
>    As, in that case, "The CAR of cache-list is the token just before
>    our change, but wasn't modified.", a solution could be to first try
>    to get reparse-symbol from tokens, then from cache-list, like this:
>
>             (setq reparse-symbol
>                   (semantic-token-get (car (or tokens cache-list))
>                                       'reparse-symbol))
>
>    I tried the above change and it seems to work better.

I think having TOKENS be null at this point is a curious problem
caused when editing white-space.  At the same time, if that
white-space is inside some parent, we need to know at what symbol to
start parsing again.  Fortunately, I think your insight in using the
cache list is a good idea.  Those tokens belong to the same lineage
as the white space edited.  If cache-list is nil, we should probably
then go and just mark the parent as the dirty item, or force a full
reparse.  (Not too expensive if there are no tokens in the file. ;)

>2. The inserted text is parsed using the grammar rule pointed by the
>    reparse-symbol found in token just before the new text.
>    Sometimes, that rule does not preserve the right semantic of the
>    inserted text!  Here is an example with WY grammar:
>
>    1. Initial state.  The following text result in one 'nonterminal
>       token: any-value, that contains five nonterminal children:
>       any_symbol, STRING, NUMBER, PREFIX-EXP, PAREN_BLOCK, as 'rule
>       tokens.
>
>       any_value:
>           any_symbol
>         | STRING
>         | NUMBER
>         | PREFIX-EXP
>         | PAREN_BLOCK
>         ;
>
>    2. Now I insert a new TEST rule between STRING and NUMBER, like
>       this (the change is enclosed in [...]):
>
>       any_value:
>           any_symbol
>         | STRING[
>         | TEST]
>         | NUMBER
>         | PREFIX-EXP
>         | PAREN_BLOCK
>         ;
>
>       In that case `semantic-edits-change-between-tokens' returns the
>       'rule tokens from STRING to PAREN_BLOCK.  The reparse-symbol
>       `rule' is correctly retrieved from the STRING token (car
>       cache-list).  The parser successfully re-parses the inserted text
>       "\n | TEST" using the `rule' semantic.  But returns a false
>       result, that is an 'empty rule token for the first `|', followed
>       by a 'rule token for TEST :-(
>
>    In fact without other context the parser can't determine if a rule
>    like the above is actually an empty rule plus a normal rule, or
>    just the latter.  A short example:
>
>       any_value:
>         | TEST
>
>       any_value:
>           STRING
>         | TEST
>
>    In first case "| TEST" means "empty or TEST", in second one it
>    means "or TEST".  In other words, the meaning of the new text
>    depends on its position inside the nonterminal definition (inside
>    the parent 'nonterminal token)!
>
>    That demonstrates a conflict between the semantic of reparse-symbol
>    and the way it is currently used by the incremental parser to parse
>    change between tokens.
>
>    IMO, the reparse-symbol rule is safe to use only when re-parsing a
>    whole token, or a new token out of context, that is inserted
>    between existing tokens at top level.
>
>    When new text is inserted between existing tokens which are part of
>    a parent token, the only safe way to re-parse things is to re-parse
>    the whole parent token.  It will ensure that the semantic of
>    inserted text will be correct.
  [ ... ]

I think an important difference between your analysis and mine is that
I think it is ok to reparse tokens that have a parent, but only if
those tokens were generated using `semantic-repeat-parse-whole-stream',
as opposed to a recursive rule in a wisent grammar.

I think you even cover in your wisent manual the benefits of using
wisent style repetitive rules for .wy rules as opposed to the
semantic version.  A side effect seems to be that it breaks the
incremental parser.  If you can identify this specific scenario in
your patch, I think it would be ok.

Have fun
Eric

-- 
          Eric Ludlam:                 za...@gn..., er...@si...
   Home: http://www.ludlam.net            Siege: www.siege-engine.com
Emacs: http://cedet.sourceforge.net               GNU: www.gnu.org