Just noticed that this wasn't sent to the list...  GMails' reply button is really bugging me. :-\

---------- Forwarded message ----------
From: Mihai Călin Bazon <mihai.bazon@gmail.com>
Date: 2011/3/24
Subject: Re: [cedet-semantic] JavaScript support
To: "Eric M. Ludlam" <eric@siege-engine.com>


Hi Eric,

Thanks for your reply, and sorry for my late reaction -- quite busy during
the week.  I'd like to resume on this, could you help figuring out what's
broken in the following code?  I tried to reduce to a minimal test case
about getting PAREN_BLOCK to work.  I'm starting with a fresh Emacs, load
the file below, C-c C-c, M-x eval-buffer (to evaluate the generated grammar)
then M-x semantic-load-enable-semantic-debugging-helpers and I load a JS
file that contains (foo, bar, baz).  My expectation would be that it parses
and semantic-fetch-tags would return a function tag, but it returns nil
instead and everything except whitespace is underlined in red, which
suggests a parsing error.

Cheers,
-Mihai

;;;; WY

%package wisent-ecmascript

%languagemode javascript-mode js-mode

%start program
%start statement
%start namelist

;;; --- punctuation

%type <punctuation>

%token <punctuation> SEMICOLON ";"
%token <punctuation> COMMA ","

;;; --- blocks

%type <block> ;;syntax "\\s(\\|\\s)" matchdatatype block

%token <block> PAREN_BLOCK "(LPAREN RPAREN)"
%token <block> BRACE_BLOCK "(LBRACE RBRACE)"
%token <block> BRACK_BLOCK "(LBRACK RBRACK)"

%token <open-paren>  LPAREN "("
%token <close-paren> RPAREN ")"
%token <open-paren>  LBRACE "{"
%token <close-paren> RBRACE "}"
%token <open-paren>  LBRACK "["
%token <close-paren> RBRACK "]"

;;; --- symbols

%type <symbol>
%token <symbol> NAME

%%

program
  : statement
  ;

statement
  : PAREN_BLOCK
    (FUNCTION-TAG "test" nil (EXPANDFULL $1 namelist))
  ;

namelist
  : LPAREN
    ()
  | RPAREN
    ()
  | NAME
    (VARIABLE-TAG $1 nil nil)
  | COMMA
    ()
  ;

%%

(require 'semantic-java)
(require 'semantic-wisent)

(define-lex ecmascript-lexer
  ""
  semantic-lex-ignore-whitespace
  semantic-lex-ignore-newline
  semantic-lex-ignore-comments

  wisent-ecmascript--<symbol>-regexp-analyzer
  wisent-ecmascript--<punctuation>-string-analyzer
  wisent-ecmascript--<block>-block-analyzer

  semantic-lex-default-action)

(defun wisent-ecmascript-setup-parser ()
  (wisent-ecmascript--install-parser)
  (setq semantic-lex-analyzer 'ecmascript-lexer
        semantic-lex-number-expression semantic-java-number-regexp
        semantic-lex-depth nil
        semantic-command-separation-character ";"))

(add-hook 'js-mode-hook 'wisent-ecmascript-setup-parser)
(add-hook 'javascript-mode-hook 'wisent-ecmascript-setup-parser)
(add-hook 'ecmascript-mode-hook 'wisent-ecmascript-setup-parser)

2011/3/21 Eric M. Ludlam <eric@siege-engine.com>

On 03/21/2011 04:29 AM, Mihai Călin Bazon wrote:
Hi folks,

I've spent my weekend with CEDET and must say it's amazing; if only I'd
understand it better. :-) My goal was to add proper support for JavaScript
(sorry but the existing parser doesn't cut it for real world code).  I've
started it from scratch, to better understand how to write parsers, but I
didn't get far.

It's always nice to have some new folks trying things out.


The Semantic/Wisent manuals are quite good, yet I've had trouble getting
started and doing simple things.  I think a step-by-step HOWTO on adding
support for a simple language (with nested structures) would be very
welcome!

That's a good idea.  There are some skeleton files around, but I don't think they go too deep into anything like that.


So anyway, I'm attaching my (highly incomplete) work so far and hope for
some advice on how to continue.  Questions:

I will attempt to answer given the brief amount of time i have this AM.


- I don't seem to be able to parse more than one statement.  Presumably
  because the return value of the `statement' rule is wrong.  Generally I
  couldn't figure out how to return the proper values.

Each nonterminal you define with a %start pragma should return 1 production.  The entire grammar you create is called iteratively, and the automatic value passing of the wisent parser generator framework is setup for this.  The iterative nature makes error recover very simple. The grammar just "fails", and the upper level iterative parser skips over the bad semantics and moves on.

Thus, it returns only one thing because that is all it can do.  If you change statlist to only return statement, then after it finds the first statement, it should get called a second time, and return the second statement, and the parser framework will keep track of it all for you.


- I tried to use PAREN_BLOCK and the `iterative style' to parse variable
  declarations, did that exactly as in other existing parsers and as
  documented, but it wouldn't work...  I know "it doesn't work" is not good
  information, but that's all I can say, the blocks simply didn't
parse.  So
  I switched to the recursive style and collecting (EXPANDTAG (VARIABLE-TAG
  ...)).  (btw, perhaps something similar should go in `statlist'?).

EXPANDFULL will use the same iterative nature as I describe above inside a parent block.  The nonterminal symbol passed to EXPANDFULL should have rules about (,  ), and some variable declation.

If you use EXPANDTAG, you need to create your own rule that parsers ( varlist ) and the varlist will need to cons all the found variables together itself.  It is much easier to use EXPANDFULL, as it handles bad syntax easily.


- That seems to parse: a function declaration with an argument list:

  function foo(a, b) {
  }

  (semantic-fetch-tags) returns the function tag and the variables are
  there.  However, if I complicate that a bit:

  function foo(a, b) {
    function bar() {
    }
  }

  only the outer function is returned.  Inner functions are ubiquituous in
  JS and they need to be parsed correctly to provide useful functionality
  (BTW, the existing JS parser distributed with Wisent fails here too).

The semantic lexer skips over { } and ( ) blocks and does not go into them unless a rule action explicitly calls EXPANDTAG or EXPANDFULL on the value returned from the PAREN_BLOCK.

In your nonterminal for a function, the BRACE_BLOCK part of the rule will need to be passed into EXPANDFULL which will iteratively parser your function body looking for more functions.  Code will show up as bad syntax unless you write rules for all that too.


- what is EXPANDTAG? and is it related to the value of
  semantic-tag-expand-function? (I just copied the expander from the
  existing JS mode for now, but I'd like to understand why is it useful,
  what argument it receives and what it should return).  Didn't put too
much
  time into this yet, but from the docs I'm not clear.

EXPANDTAG and EXPANDFULL let you look inside some _BLOCK with a new nonterminal start.  For each rule you pass to EXPAND* you need to add a %start pragma.  The output of EXPAND* will be (presumably) some tag or tag list.  EXPANDFULL will return a tag list and handle expanding and the data needed for "cooking" the tags so they are bound into the buffer with overlays.

In your support file, you will need to write an overload of semantic-tag-components if you do anything besides function arguments or type members.  For function args and type members, you just need to put the tag lists into the correct tag attributes.


- generally, how do you debug a grammar?

You can debug a rule in wisent, but not how the wisent grammar parses. the grammar debugger was never ported to wisent. :(


* * *

I have in mind a few things for now:

- be able to detect the local variables around the cursor.  For example if I
  place the cursor on a variable, it should highlight the occurrences of
  that name in the enclosing scope.  I already did something like this for
  js2-mode [1], but I'd like to get rid of that setup.

If you visit http://cedet.sourceforge.net/addlang.shtml step 4 is about context parsing.


- having done the above, it should be easy to provide some keybindings to
  quickly move through such occurrences, and a keybinding to rename the
  variable (again, my js2-mode setup supports that).

semantic-symref output (using idutils, gnu global, or other) has features like that that may "just work" for you.


- use the knowledge from the parser to indent var properly:

  var foo = 1,
      bar = 2;

Many folks have wanted to do this, but as far as I know, no one has built a framework for it.


Then of course I know that much functionality would come for free from
existing Semantic applications.

JavaScript is quickly taking over the world (it's the most popular language
on GitHub right now) and it's a pity not to support it properly.  I have
some previous knowledge on parsing JavaScript [2] and I use Emacs for 12
years now; though I'm not very skilled at Elisp, I do Common Lisp at my day
job and have good knowledge of it.  I'm willing to invest the time to write
this parser for Semantic, just need some help! :-)

Thanks in advance!
-Mihai

[1]
http://mihai.bazon.net/projects/editing-javascript-with-emacs-js2-mode/js2-highlight-vars-mode
[2] https://github.com/mishoo/UglifyJS

PS: By the way, don't you guys consider switching to GitHub?  SourceForge
is... uhm... better not say it.

I've been too lazy to look into reasons to move anything.  Right now we're just trying to get a good method for keeping synchronized with Emacs.  The tactic is good, but it takes a lot of time.

Eric





--
Mihai Bazon,
http://mihai.bazon.net/blog