[Ctags-devel] New PHP parser
Brought to you by:
dhiebert
|
From: Colomban W. <lis...@he...> - 2013-04-20 21:50:23
|
Hi, I've rewritten the PHP parser from scratch as a non-regex parser, for it to be able to be more correct and precise, like not getting fooled by comments and reporting scope and implementation details. TL;DR is: check out the attached patch at https://sourceforge.net/tracker/?func=detail&aid=3611477&group_id=6556&atid=306556. # What's wrong with current parser? NOTE: I haven't been able to test the PHP parser in trunk, because it doesn't seem to work at all: it's not even reported in the --list-parsers output. Any hint on what's going one would be appreciated. Anyway though, I tested the parser in Geany, which has only slight modifications over trunk ctags one. The current parser uses regular expressions. This is simple, but limited: it is barely possible to report scope for the tags, and it gets easily fooled by comments or string literals. For example, commented out code in PHP files still generating tags lead to a lot of bug reports, and was only partially fixed by checking whether the declaration was at the start of the line. Also, and although it could have been added, the current parser doesn't parse namespaces or traits. It also reports local and global variables with no distinction, and having hundreds of variable tags may be a little annoying -- and this isn't really fixable with the regex approach. # What does the new parser? The parser I wrote and propose is a full character-level token-based parser [1]. It should understand every syntactic elements in PHP, including PHP mode entering/leaving, comments, strings, heredocs, nowdocs, etc., at least everything I know or could gather from PHP 5.4 docs. This means that the parser should never get fooled by whitespaces, strings or comments. ## What does it support? Tags are generated for most constructs: * classes; * constants (both "const FOO" and "define('FOO', ...)"); * functions/methods; * interfaces; * local variables (disabled by default); * namespaces (both syntax); * traits; * global (like locals, but in the root scope) and member variables. For the record, the old parser understood classes, constants, functions, interfaces and variables (no distinctions between local, global and member). Also, the new parser reports the following details: * scope (for everything); * function/methods arguments; * visibility (private/protected/public/none) of methods and members; * class, interface and member implementation (abstract or not). ## Non-standard behavior Currently, this new parser accepts identifiers starting with a number (which includes numbers). Those wouldn't be accepted by PHP, but it was simpler to implement and should never affect valid code. The only thing is that some invalid PHP code could generate tags whose names start with a number. I don't think this is an issue since the parsed code would have to be invalid anyway, but if it is a big deal I can also fix it (although it would add more code and more tests for no other reason). ## Limitations PHP supports several constructs to enter or leave PHP mode: Entering: * <?php (classic, most common) * <? (short open tag) * <?= (sort echo tag) * <script language="php"> (long version) * <% (ASP tag) Leaving: * ?> (classic) * </script> (long version) * %> (ASP tag) Short open tags and ASP tags are not always available with PHP and may require special settings to be enabled for them to work. PHP docs discourage use of anything other than "<?php ?>" and "<?= ?>". The new parser only currently support entering with "<?" followed by anything not "xml" (so this includes "<?php", "<?" and "<?="), and leaving with "?>". So, it doesn't support the long "<script language="php">" or the ASP tags. It would be easy to add support for ASP tags, I just wasn't sure if it wouldn't risk to falsely enter PHP mode (and then lead to incorrect output) in some case, e.g. if files contains literal "<%" or "%>" which aren't supposed to be interpreted as PHP. Also, ASP tags are rarely used with PHP, and I never saw any code using these. Supporting "<script language="php">" would be a little more boring to implement, but also doable. However, this style is very uncommon (I myself never saw it in any code), and many editors don't handle this one (or at least not correctly), so I didn't spent time trying to implement this. If however this is an issue, I could make those work too. For the record, the old parser completely ignores the PHP mode, and assumes any line is PHP. ## What did the old parser do this one don't? The old parser used to (try to) report JavaScript functions. The new one don't. It of course would be possible to report those too, but it would be quite hard (since it's another language, and outside the PHP mode), and I don't think it makes much sense to have JavaScript functions shown as PHP. IMHO, if somebody really wants the feature, something more global should be developed, like some ability to parse chunks of a single file with different parsers (e.g. the PHP parser would tell the HTML parser to kick in, and this HTML parser would ask the JavaScript parser to deal with some lines, etc.). This would probably be quite some work to implement, but it would be better than trying to re-do a JavaScript/HTML parser in the PHP one -- and some other languages could perhaps benefit from this too, I don't know. My point is that I don't think it makes sense for the PHP parser to generate non-PHP tags, so I didn't do anything to try generating those "jsfunction" tags. However if it is a problem, I could add this too -- but it wouldn't ever be as good as the JavaScript parser, especially because parsing JS isn't easy. ## Bugs this parser fixes This new parser fixes at least those PHP-related bugs, feature requests or patches: * https://sourceforge.net/tracker/?func=detail&aid=3602177&group_id=6556&atid=106556 * https://sourceforge.net/tracker/?func=detail&aid=3220198&group_id=6556&atid=106556 * https://sourceforge.net/tracker/?func=detail&aid=1792989&group_id=6556&atid=356556 * https://sourceforge.net/tracker/?func=detail&aid=960547&group_id=6556&atid=306556 * https://sourceforge.net/tracker/?func=detail&aid=3074659&group_id=6556&atid=306556 # The patch See the patch item https://sourceforge.net/tracker/?func=detail&aid=3611477&group_id=6556&atid=306556 This is the complete patch against trunk. If you want, I can either provide the individual patches (I did 22 patches to get here), or the new version of each file, just tell me what you prefer. This patch can be applied easily with the `patch -p1 < 0000-PHP-parser-rewrite.patch` command (under Unices). Thanks for reading this long mail :) Regards, Colomban [1] The "tokenizer" was mostly based on JavaScript's one (which is the best part of the JS parser...), so the readToken() function looks really similar. |