#75 severe matching performance regression

closed-fixed
None
6
2010-08-27
2010-06-02
No

The gate keeper glob pattern matching added on 2007-07-11 can be very slow. In a test suite I use it seems to be O(n*n) or worse in the length of the input buffer. The attached test file spawns an echo command that prints a fixed string, truncated to different lengths. It then uses a regular expression to match the first line of the input.

In expect 5.43.0, this was super-fast (the test prints 0 or 1 ms for each test). Using 5.44.1.15, I get the following output on the same computer:

Trying 80... done (86 ms)
Trying 90... done (194 ms)
Trying 95... done (397 ms)
Trying 100... done (789 ms)
Trying 101... done (1088 ms)
Trying 102... done (1134 ms)
Trying 103... done (1551 ms)
Trying 104... done (1573 ms)
Trying 105... done (1609 ms)
Trying 106... done (1627 ms)
Trying 107... done (2183 ms)
Trying 108... done (2209 ms)
Trying 109... done (2867 ms)
Trying 110... done (2917 ms)
Trying 111... done (2958 ms)
Trying 112... done (2996 ms)
Trying 113... done (3857 ms)
Trying 114... done (3898 ms)
Trying 115... done (5088 ms)

Yes, that is right. More than five seconds to perform a match that expect 5.43 did in 1 millisecond.

The test case is a simplified demonstration of a bug I found in a real-life test suite for the LysKOM server. See http://www.lysator.liu.se/lyskom/lyskom-server/ if you want to see the real-life test suite.

Expect version: 5.44.1.15 (compiled from source on Ubuntu 8.04) and 5.44.1.14 (as included in Ubuntu 10.04).

Discussion

  • Per Cederqvist

    Per Cederqvist - 2010-06-02

    Reproduce the slow matching

     
  • Per Cederqvist

    Per Cederqvist - 2010-06-02

    Running expect with -d indicates that it uses this glob gate:

    MRK:client1: :18 14 6 * * * * * * * * * 5 0 8 0 2 { 0 1 6 5 } 0 *\n

    Things might work better if you ensure that the glob gate never contains more than a single "*". I susepct that this would be a much better gate glob:

    MRK:client1: :18 14 6 *\n

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2010-06-02
    • priority: 5 --> 6
    • assigned_to: nobody --> hobbs
     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2010-06-02

    I believe this is related to the slow performance of the Re2Glob glob matching for multiple * patterns.

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2010-06-02

    BTW, while I do mean to address this in the code, the RE presented is quite suboptimal. It is:

    "^MRK:client1: :18 14 6 \[0-9\]* \[0-9\]* \[0-9\]* \[0-9\]* \[0-9\]* \[0-9\]* \[0-9\]* \[0-9\]* \[0-9\]* 5 0 8 0 2 { 0 1 6 5 } 0 \\\\*\r?\n"

    But what is the point of all the _zero_ or more matches one after the other? It really should be just one * (or RE .*).

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2010-06-02

    Let me adjust that ... it is just allowing for separation between space elements. Still could be optimized, but that level of opt isn't being handled low-level in Re2Glob.

     
  • Alexandre Ferrieux

    Andreas was kind enough to explain the basic mechanism of the 'glob gate' to me on the chat. Now I am scared, because this mechanism implies that even a carefully deterministic RE passed to Expect in -re, will get approximated by a dumbo glob pattern which is much less deterministic, hence will take ages to parse.

    Example: {^[0-9]+ [0-9]+$}, which is fully deterministic, becomes something like {* *}, which is not at all.

    [00:10] * jenglish observing from the sidelines, notes that "Q: is this regular expression deterministic" is approximately as expensive as "Q: please convert this regular expression to an NFA, kthxbye."
    [00:11] * ferrieux wonders why there's an RE2GLOB in the first place, for this very reason.

    Question: since (1) the regexp engine works by compiling the RE into a DFA, (2) DFA runtime is extremely fast, (3) DFA compilation time is not an issue for the -re flag of Expect, and (4) DFA size will be very small for typically hand-determinized regexps... why do we go through this dangerous glob step ?

     
  • Per Cederqvist

    Per Cederqvist - 2010-07-26

    Add test case for this issue.

     
  • Per Cederqvist

    Per Cederqvist - 2010-07-26

    Preferred solution: remove the glob gate functionality.

     
  • Per Cederqvist

    Per Cederqvist - 2010-07-26

    Alternate solution: disable glob gate if the generated glob is too complex.

     
  • Per Cederqvist

    Per Cederqvist - 2010-07-26

    Alternate solution: disable the glob gate unless EXPECT_ENABLE_GLOB_GATE environment variable exists.

     
  • Per Cederqvist

    Per Cederqvist - 2010-07-26

    I have now attached four files:

    - a test case that can be added to the test suite (please check that the number hasn't already been taken by another test when applying the patch).

    - a patch that reverts the glob gate functionality. This is, in my opinion, the best way to resolve this issue.

    - two alternative patches, that keep the glob gate, but only enables it in certain circumstances. One requires that you request the glob gate functionality by setting the EXPECT_ENABLE_GLOB_GATE environment variable (the value is ignored, only the presence of the variable is checked for). The other disables the glob gate if the generated glob contains more than two "*" patterns, which could lead to excessive backtracking.

    Even if the long-term plan is to fix the glob pattern matcher so that it is deterministic when multiple "*" patterns are used, please make an interim bug-fix release with one of the proposed solutions (or something similar) so that I don't have to tell the users of lyskomd that they need to patch their expect to run the test suite.

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2010-08-27
    • status: open --> closed-fixed
     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2010-08-27

    Went with 2-asterisk check in expect 5.45

     

Log in to post a comment.