#73 Jericho is very slow on some pages

General
closed-rejected
nobody
None
5
2014-10-29
2013-11-06
Sebastiano Vigna
No

During our crawls we have found pages on which the parser is incredibly slow (almost one minute for a 1677728 bytes page). Apparently, the parser is stuck in CharSequenceParseText.indexOf(String,int,int), likely indulging in some sort of quadratic behaviour.

One of the pages is included. Just try it with the StreamedSourceCopy example.

This is really a problem for us... please help :).

1 Attachments

Discussion

  • Martin Jericho
    Martin Jericho
    2013-11-06

    Hi Sebastiano,

    If you check the log file you will see thousands of errors like this:
    SEVERE: StartTag % at (p112935) not recognised as type 'common server tag' because it has no closing delimiter

    This is because the source file contains javascript string literals containing URI encoded HTML, which encodes the start of end tags at "<%2F". By default the parser interprets these as the start of server tags <%...%> and searches to the end of the file for the matching end delimiter.

    This is a very rare problem because there is normally no reason to URI encode javascript literals containing HTML code. It's a very strange thing to do.

    To solve this problem just deregister the server tag types, which shouldn't appear in public HTML pages anyway. This could also slightly speed up performance in general.

    StartTagType.SERVER_COMMON.deregister();
    StartTagType.SERVER_COMMON_COMMENT.deregister();
    StartTagType.SERVER_COMMON_ESCAPED.deregister();

    Cheers
    Martin

     
  • Martin Jericho
    Martin Jericho
    2014-10-29

    • status: unread --> closed-rejected
     
  • Martin Jericho
    Martin Jericho
    2014-10-29

    Just cleaning up after finding this bug report still open!