Affects LXR since release 1.0.0
HTML "include" processing is defined in generic.conf (since CVS revision 1.35 == release 1.0.0)
Wrong delimiters are used at start of regexp to capture initial "keyword" (usually href=): [\w] instead of (\w). This results in the 'directive' regexp never matching the initial keyword. It triggers the "pop off first word and requeue the rest" processing but 'identdef' has the same misuse of delimiters with [\w] instead of (\w) preventing it from matching the keyword. Consequently, it is not removed; the full fragment is requeued, rescanned and causes the loop.
Another bug in the 'directive' regexp is back-reference \g{-3} for the string delimiter, instead of g{-2}, which points to the empty spacer. Anyway, this is overkill since the correct spacer has been detected by the category splitter. ("|\') does as well and is downward compatible with older Perl interpreters.
Fixes in generic.conf:
line 585: 'identdef' => 'a-zA-Z*=?'
use parentheses, correctly escape \ and add optional =
line 638: 'directive' => '(\w+=)()("|\')(.+)("|\')'
use parentheses, simplify closing delimiter capture
Diff:
Another issue popped up when solving this bug: href= may contain an HTTP request like http://... where the double slash leads to trouble when computing links (i.e. the empty path element between the slashes does not match and processing endlessly loops due to the path not being shortened).
There is no way with 'include' => { 'directive' ... } to define an exception rule. Consequently, a new parser HTML.pm is written.
_linkincludedirs needs also a small adjustement to prevent looping when an href= ending with a slash is processed.
Fixed in CVS