#32 a slow CQP query...

TODO-4.0
open
nobody
CQP (5)
1
1 day ago
2009-11-17
No

The following query seemed to run rather slowly (on the BNC: someone wanted instances of "done" used for past tense.)

[hw!="be" & hw!="have"] [pos="PNP"] [word="done"]

My guess is that the slowness is because of the gargantuan number of matches for the first element in the string. (Stefan might correct me on this!) Which leads me to wonder, is there any way of optimising this kind of thing and speeding it up? OR does this happen already?

Discussion

  • Stefan Evert

    Stefan Evert - 2009-11-17

    Exactly. CQP uses a strict left-to-right matching strategy where index lookups are only performed for the first token. Unfortunately, CQP doesn't have a good built-in query optimiser, and it isn't at all fun to develop one in C. :-(

    In command-line CQP, you can use a multi-step query for such cases, which runs much faster (although [pos="PNP"] still has a lot of matches):

    A = [pos="PNP"] [word="done"];
    set A target nearest [hw = "be|have"] within left 1 word;
    delete A without target;

    We thought about having special-case optimisations for this construction in BNCweb's simple query language, but handling the multi-step queries was just too complex to be feasible.

     
  • Andrew Hardie

    Andrew Hardie - 2009-11-17

    A thought occurs: could we possibly get quite a lot of improvement with some fairly simple rules of thumb moving away from left-to-right matching?

    For instance: it could be assumed that a regex with no wildcards of any sort (ie just a literal string) will have relatively few matches (not always true,but just as a rule of thumb). Then, when doing a multi-match query, each token could be analysed and rated along lines like this:

    one or more conditions of != with a string with no wildcards: HARD
    one or more conditions of = with a string with no wildcards: EASY
    everything else, or multiple conditions of different kinds: MIDDLING
    (probably other criteria could be added)

    If CQP then started matching with the leftmost EASY token (if any), or failing that the leftmost MIDDLING token (if any), and only last resorted to HARD tokens, this kind of slowdown could be avoided. Perhaps this could be implemented by internally (within CQP) doing something akin to the multi-step procedure with the first step still being a left-to-right matching query.

    Sorry if this is all a silly idea, I'm still only starting to get to grips with CQP's internals and pondering about things as they occur to me.

     
  • Andrew Hardie

    Andrew Hardie - 2011-07-31
    • milestone: --> TODO-4.0
     
  • Stefan Evert

    Stefan Evert - 4 days ago

    Optimizing cases such as this one will need a completely different query strategy based on the MU approach. I'm thinking along these lines:

    A = MU(meet [pos="PNP"] [word="done"] 1 1);
    

    which should return the full matches in future rather than just the first position. A second step would then filter according to the negation constraint:

    B = MU(filter A where ![hw = "be|have"] -1 -1);
    

    The syntax for the second step is ad-hoc, but I hope it gets across the basic idea. Once we have this kind of query language, it should be feasible to write a pre-processor that recognized such queries and automatically translates them into multiple steps.

     
  • Andrew Hardie

    Andrew Hardie - 1 day ago

    A couple of years ago we discussed the possibilty of ad hoc optimisation for certain sorts of token-level regex query. That is, the query

    [hw!="be" & hw!="have"] [pos="PNP"] [word="done"]

    could be treated as follows:

    (1) compile the DFA fropm regex as normal,
    (2) analyse the DFA to identify one or more frequent "easy cases" that do not require the full DFA
    (here, the "easy case" being a phrase with no optionality")
    (3) if an Easy case is spotted apply an optimised algorithm
    (here, the optimisation would be to start with the token with the lowest number of matches - the third in this example - and work from there, rather than running the DFA)

    This might still be worth a try somewhere down the line. But it's certainly not any kind of priority.

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks