Thread: [CEDET-devel] What I am doing ;-)
Brought to you by:
zappo
From: David P. <dav...@wa...> - 2003-06-20 14:42:41
Attachments:
wisent-c.tar.gz
|
Hi All, It seems that our hacking of cedet code is very slight these days. I suppose we are all particularly busy at other tasks ;-) However, I checked changes in, to fix some indentation problems in semantic-grammar-mode, and to auto load the semanticdb top level search routines (that fixes errors in senator completion when semanticdb mode is enabled). Here is the change log: 2003-06-20 David Ponce <da...@dp...> * semantic/semanticdb-find.el (semanticdb-find-tags-by-name) (semanticdb-find-tags-by-name-regexp) (semanticdb-find-tags-for-completion) (semanticdb-deep-find-tags-by-name) (semanticdb-deep-find-tags-by-name-regexp) (semanticdb-deep-find-tags-for-completion) (semanticdb-find-tags-external-children-of-type): Add autoload cookies. * semantic/semantic-grammar.el (semantic-grammar-goto-grammar-indent-anchor): In certain cases `forward-sexp' ignore important punctuations. Use `skip-syntax-backward' instead to skip punctuations. 2003-06-19 David Ponce <da...@dp...> * semantic/semantic-grammar.el (semantic-grammar-mode): Set buffer local value of `parse-sexp-ignore-comments' to non-nil to fix indentation problems when there are unbalanced parenthesis in comments. (semantic-grammar-goto-grammar-indent-anchor): Consider %prec like another symbol in rule. (semantic-grammar-grammar-compute-indentation): Don't consider %prec like other percent keywords that are aligned at beginning of line. Also I started to work on a LALR grammar for C, and I must admit that the task is not easy. I read a lot of threads about that subject and discovered that the C grammar contains some nasty ambiguities that make it difficult to be LALR :-( The main problem I encountered is that C identifiers can be interpreted as typedef names or ordinary identifiers depending on context. A simple example will show what I mean: typedef struct {int x; int y} point; point point; The first occurrence of `point' is a typedef name, and the second occurrence an ordinary identifier! In quasi all implementations I studied, such ambiguities are solved by the lexer that returns an IDENTIFIER as a TYPEDEF_NAME terminal when that IDENTIFIER has been previously declared as a typedef. That requires to maintain a table of declared C symbols that take into account the scope of declarations. Unfortunately, that could work only if all preprocessor statements have been previously expanded by a first preprocessing pass. This is not a problem for normal compilation. But this is an issue for Semantic that just parses the source as it is to obtain declaration tags. And how it would be possible to do incremental re-parsing? Attached you will find a tarball of what I managed to do: - A C-tags.wy grammar hacked from the LALR grammar supplied to the community by James A. Roskind and available at <http://www.empathy.com/pccts/roskind.html>. - A quickly hacked wisent-c.el that permits to try the grammar. For now I got a very limited success in parsing the code in: - test.c, a very basic example. I also had to add a hook to `wisent-parse' to be able to do some context initialization before starting the parser. See: - wisent.el.patch, probably it would make more sense to have that hook called from `semantic-parse-stream'? The parser fails in many cases, particularly when it encounters a declaration like: EMACS_INT undo_limit; And there is no typedef for EMACS_INT, or the typedef is [probably] in a included header. After doing all that, I suspect that using a true C grammar, probably is not the right direction for Semantic, and that we will have to hack from scratch a specific LALR grammar :( Perhaps using wisent to parse C (things are even worse with C++!) is a wrong choice? Any thoughts, help, or improvement to my work will be very welcome ;-) Sincerely, David |
From: Eric M. L. <er...@si...> - 2003-06-20 15:46:49
|
Hi, I too have been busy. As my children get older, their bedtime gets later which means less time for Emacs :( I had checked in a new file `semantic-sort' that pulls more entries out of semantic-util, and renames `token' to `tag'. There is not much left in semantic-util, which means we are very close to done with the token->tag conversion which could provide an opportunity for a beta. >>> David PONCE <dav...@wa...> seems to think that: >Hi All, > >It seems that our hacking of cedet code is very slight these days. >I suppose we are all particularly busy at other tasks ;-) > >However, I checked changes in, to fix some indentation problems in >semantic-grammar-mode, and to auto load the semanticdb top level >search routines (that fixes errors in senator completion when >semanticdb mode is enabled). Thanks! [ ... ] > >Also I started to work on a LALR grammar for C, and I must admit that >the task is not easy. I read a lot of threads about that subject and >discovered that the C grammar contains some nasty ambiguities that >make it difficult to be LALR :-( That's great! (That a wisent c parser has been started.) See way down below for more: >The main problem I encountered is that C identifiers can be >interpreted as typedef names or ordinary identifiers depending on >context. > >A simple example will show what I mean: > >typedef struct {int x; int y} point; >point point; > >The first occurrence of `point' is a typedef name, and the second >occurrence an ordinary identifier! > >In quasi all implementations I studied, such ambiguities are solved >by the lexer that returns an IDENTIFIER as a TYPEDEF_NAME terminal >when that IDENTIFIER has been previously declared as a typedef. >That requires to maintain a table of declared C symbols that take >into account the scope of declarations. > >Unfortunately, that could work only if all preprocessor statements >have been previously expanded by a first preprocessing pass. This is >not a problem for normal compilation. But this is an issue for >Semantic that just parses the source as it is to obtain declaration >tags. > >And how it would be possible to do incremental re-parsing? > >Attached you will find a tarball of what I managed to do: > >- A C-tags.wy grammar hacked from the LALR grammar supplied to the > community by James A. Roskind and available at > <http://www.empathy.com/pccts/roskind.html>. > >- A quickly hacked wisent-c.el that permits to try the grammar. > >For now I got a very limited success in parsing the code in: > >- test.c, a very basic example. > >I also had to add a hook to `wisent-parse' to be able to do some >context initialization before starting the parser. See: > >- wisent.el.patch, probably it would make more sense to have that hook > called from `semantic-parse-stream'? > >The parser fails in many cases, particularly when it encounters a >declaration like: > >EMACS_INT undo_limit; > >And there is no typedef for EMACS_INT, or the typedef is [probably] >in a included header. > >After doing all that, I suspect that using a true C grammar, probably >is not the right direction for Semantic, and that we will have to hack >from scratch a specific LALR grammar :( > >Perhaps using wisent to parse C (things are even worse with C++!) is a >wrong choice? [ ... ] You have many good observations. When it comes to creating a simple tagging parser (as we did with c.by) many aspects of the language that pose problems have little comments that say "There could be an error, but assume it is right." meaning that something is syntactically correct, but could be an error based on some wider context. The biggest problem are custom macros like: #define MYFUNC (name) int name (fancy arg, list here) MYFUNC(blah) { code; } which we currently completely ignore. Fortunately c++'s complexity results in a reduction of such goofy macros. Anyway, one thought to overcome those macros is to have a preprocessing step in semantic. It might call an actual pre-processor, and the semanic parser would track line number pragmas. That would be a pain to do though. This leads me to the long term plans for semantic. A goal of mine is for the user to leave the cursor sitting at some location, and Emacs can suggest all possible suggestions for completion, or documentation based on the context. Meaning it would search only tag tables specified by include statements. This capability would enabled (in effect) pre-compiled headers or header parsing state which could (as you suggest) feed into the active lexical keyword table, or symbol->keyword translation. Unfortunately this is probably a ways off, and is related somewhat to the semanticdb-find API I had started. For the short term, I suspect the best tactic is to simply let your parser accept all code that "might" be correct. c.by has a rule like (paraphrased) like this: variable: typesimple name opt-defaultvalue ; typesimple: built-in-type | struct-or-union | symbol ; which simply assumes that in the right place (where a variable might appear) that the programmer knows what they're doing and that the symbol is probably a valid type. When real type analysis is available said symbol can be examined, and semantic can identify the problem, it should still accept it, but underline it indicating a problem. Such actions, even in the presence of known invalid code, will make the parser more robust, and provide better and more useful diagnostics while editing. Thanks! Eric -- Eric Ludlam: za...@gn..., er...@si... Home: http://www.ludlam.net Siege: www.siege-engine.com Emacs: http://cedet.sourceforge.net GNU: www.gnu.org |