[Ctags-devel] New PHP parser

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I've rewritten the PHP parser from scratch as a non-regex parser, for it
to be able to be more correct and precise, like not getting fooled by
comments and reporting scope and implementation details.

TL;DR is: check out the attached patch at
https://sourceforge.net/tracker/?func=detail&aid=3611477&group_id=6556&atid=306556.

# What's wrong with current parser?

NOTE: I haven't been able to test the PHP parser in trunk, because it
doesn't seem to work at all: it's not even reported in the
--list-parsers output.  Any hint on what's going one would be appreciated.
Anyway though, I tested the parser in Geany, which has only slight
modifications over trunk ctags one.

The current parser uses regular expressions.  This is simple, but
limited: it is barely possible to report scope for the tags, and it gets
easily fooled by comments or string literals.

For example, commented out code in PHP files still generating tags lead
to a lot of bug reports, and was only partially fixed by checking
whether the declaration was at the start of the line.

Also, and although it could have been added, the current parser doesn't
parse namespaces or traits.  It also reports local and global variables
with no distinction, and having hundreds of variable tags may be a
little annoying -- and this isn't really fixable with the regex approach.

# What does the new parser?

The parser I wrote and propose is a full character-level token-based
parser [1].  It should understand every syntactic elements in PHP,
including PHP mode entering/leaving, comments, strings, heredocs,
nowdocs, etc., at least everything I know or could gather from PHP 5.4
docs.  This means that the parser should never get fooled by
whitespaces, strings or comments.

## What does it support?

Tags are generated for most constructs:

* classes;
* constants (both "const FOO" and "define('FOO', ...)");
* functions/methods;
* interfaces;
* local variables (disabled by default);
* namespaces (both syntax);
* traits;
* global (like locals, but in the root scope) and member variables.

For the record, the old parser understood classes, constants, functions,
interfaces and variables (no distinctions between local, global and member).

Also, the new parser reports the following details:

* scope (for everything);
* function/methods arguments;
* visibility (private/protected/public/none) of methods and members;
* class, interface and member implementation (abstract or not).

## Non-standard behavior

Currently, this new parser accepts identifiers starting with a number
(which includes numbers).  Those wouldn't be accepted by PHP, but it was
simpler to implement and should never affect valid code.  The only thing
is that some invalid PHP code could generate tags whose names start with
a number.

I don't think this is an issue since the parsed code would have to be
invalid anyway, but if it is a big deal I can also fix it (although it
would add more code and more tests for no other reason).

## Limitations

PHP supports several constructs to enter or leave PHP mode:

Entering:

* <?php (classic, most common)
* <? (short open tag)
* <?= (sort echo tag)
* <script language="php"> (long version)
* <% (ASP tag)

Leaving:

* ?> (classic)
* </script> (long version)
* %> (ASP tag)

Short open tags and ASP tags are not always available with PHP and may
require special settings to be enabled for them to work.  PHP docs
discourage use of anything other than "<?php ?>" and "<?= ?>".

The new parser only currently support entering with "<?" followed by
anything not "xml" (so this includes "<?php", "<?" and "<?="), and
leaving with "?>".  So, it doesn't support the long "<script
language="php">" or the ASP tags.

It would be easy to add support for ASP tags, I just wasn't sure if it
wouldn't risk to falsely enter PHP mode (and then lead to incorrect
output) in some case, e.g. if files contains literal "<%" or "%>" which
aren't supposed to be interpreted as PHP.  Also, ASP tags are rarely
used with PHP, and I never saw any code using these.

Supporting "<script language="php">" would be a little more boring to
implement, but also doable.  However, this style is very uncommon (I
myself never saw it in any code), and many editors don't handle this one
(or at least not correctly), so I didn't spent time trying to implement
this.

If however this is an issue, I could make those work too.

For the record, the old parser completely ignores the PHP mode, and
assumes any line is PHP.

## What did the old parser do this one don't?

The old parser used to (try to) report JavaScript functions.  The new
one don't.

It of course would be possible to report those too, but it would be
quite hard (since it's another language, and outside the PHP mode), and
I don't think it makes much sense to have JavaScript functions shown as PHP.

IMHO, if somebody really wants the feature, something more global should
be developed, like some ability to parse chunks of a single file with
different parsers (e.g. the PHP parser would tell the HTML parser to
kick in, and this HTML parser would ask the JavaScript parser to deal
with some lines, etc.).  This would probably be quite some work to
implement, but it would be better than trying to re-do a JavaScript/HTML
parser in the PHP one -- and some other languages could perhaps benefit
from this too, I don't know.

My point is that I don't think it makes sense for the PHP parser to
generate non-PHP tags, so I didn't do anything to try generating those
"jsfunction" tags.

However if it is a problem, I could add this too -- but it wouldn't ever
be as good as the JavaScript parser, especially because parsing JS isn't
easy.

## Bugs this parser fixes

This new parser fixes at least those PHP-related bugs, feature requests
or patches:

*
https://sourceforge.net/tracker/?func=detail&aid=3602177&group_id=6556&atid=106556
*
https://sourceforge.net/tracker/?func=detail&aid=3220198&group_id=6556&atid=106556
*
https://sourceforge.net/tracker/?func=detail&aid=1792989&group_id=6556&atid=356556
*
https://sourceforge.net/tracker/?func=detail&aid=960547&group_id=6556&atid=306556
*
https://sourceforge.net/tracker/?func=detail&aid=3074659&group_id=6556&atid=306556

# The patch

See the patch item
https://sourceforge.net/tracker/?func=detail&aid=3611477&group_id=6556&atid=306556

This is the complete patch against trunk.  If you want, I can either
provide the individual patches (I did 22 patches to get here), or the
new version of each file, just tell me what you prefer.  This patch can
be applied easily with the `patch -p1 < 0000-PHP-parser-rewrite.patch`
command (under Unices).

Thanks for reading this long mail :)

Regards,
Colomban

[1] The "tokenizer" was mostly based on JavaScript's one (which is the
best part of the JS parser...), so the readToken() function looks really
similar.