#234 HTML parser loops endlessly on href=

current_cvs
closed-fixed
None
5
2013-08-31
2013-04-17
No

Affects LXR since release 1.0.0
HTML "include" processing is defined in generic.conf (since CVS revision 1.35 == release 1.0.0)
Wrong delimiters are used at start of regexp to capture initial "keyword" (usually href=): [\w] instead of (\w). This results in the 'directive' regexp never matching the initial keyword. It triggers the "pop off first word and requeue the rest" processing but 'identdef' has the same misuse of delimiters with [\w] instead of (\w) preventing it from matching the keyword. Consequently, it is not removed; the full fragment is requeued, rescanned and causes the loop.

Another bug in the 'directive' regexp is back-reference \g{-3} for the string delimiter, instead of g{-2}, which points to the empty spacer. Anyway, this is overkill since the correct spacer has been detected by the category splitter. ("|\') does as well and is downward compatible with older Perl interpreters.

Fixes in generic.conf:
line 585: 'identdef' => 'a-zA-Z*=?'
use parentheses, correctly escape \ and add optional =
line 638: 'directive' => '(\w+=)()("|\')(.+)("|\')'
use parentheses, simplify closing delimiter capture

Discussion

  • Andre-Littoz

    Andre-Littoz - 2013-04-17
    • status: open --> closed
     
  • Andre-Littoz

    Andre-Littoz - 2013-04-17
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -2,10 +2,11 @@
     HTML "include" processing is defined in generic.conf (since CVS revision 1.35 == release 1.0.0)
     Wrong delimiters are used at start of regexp to capture initial "keyword" (usually href=): [\\w] instead of (\\w). This results in the 'directive' regexp never matching the initial keyword. It triggers the "pop off first word and requeue the rest" processing but 'identdef' has the same misuse of delimiters with [\w] instead of (\\w) preventing it from matching the keyword. Consequently, it is not removed; the full fragment is requeued, rescanned and causes the loop.
    
    +Another bug in the 'directive' regexp is back-reference \g{-3} for the string delimiter, instead of g{-2}, which points to the empty spacer. Anyway, this is overkill since the correct spacer has been detected by the category splitter. ("|\') does as well and is downward compatible with older Perl interpreters.
    +
     Fixes in generic.conf:
     line 585: 'identdef' => '[a-zA-Z](\\w)*=?'
         use parentheses, correctly escape \ and add optional =
    -line 638: 'directive' => '((\\w)+=)()("|\')(.+)(\g{-3})'
    -    use parentheses
    -
    -Fixed in CVS (for release 1.2.0)
    +line 638: 'directive' => '(\\w+=)()("|\')(.+)("|\')'
    +    use parentheses, simplify closing delimiter capture
    +    
    
    • status: closed --> open
     
  • Andre-Littoz

    Andre-Littoz - 2013-04-18

    Another issue popped up when solving this bug: href= may contain an HTTP request like http://... where the double slash leads to trouble when computing links (i.e. the empty path element between the slashes does not match and processing endlessly loops due to the path not being shortened).

    There is no way with 'include' => { 'directive' ... } to define an exception rule. Consequently, a new parser HTML.pm is written.

    _linkincludedirs needs also a small adjustement to prevent looping when an href= ending with a slash is processed.

     
  • Andre-Littoz

    Andre-Littoz - 2013-04-19

    Fixed in CVS

     
  • Andre-Littoz

    Andre-Littoz - 2013-04-19
    • status: open --> closed
     
  • Andre-Littoz

    Andre-Littoz - 2013-08-31
    • status: closed --> closed-fixed
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks