Menu

#1684 Slow-running filter

3.0.x
pending
nobody
None
5
2016-03-21
2015-05-17
Quaraxkad
No

I have a bunch of filters that convert date strings into the proper format of (for example) 2015-05-16. Some of them had been falsely matching numbers in URLs, so I made a change that largely fixes the issue but runs very slowly. Is there any way to tweak this filter to speed it up, or maybe a whole different method...?

Here's the scenario for this particular filter: A date printed like 2015-5-16. I want to change it to 2015-05-16, so this was my original filter (line breaks added for readability):

search: (\d\d\d\d)-(\d-\d\d)(\D)
replace: $1$2-0$3$4
mod: gis

This works, but it will also match a URL like href="/path/1111-2-33.htm". So to get around that, I figured I'd use a negative lookbehind and came up with this:

search: (?<!(?:src|href)=")([^"]+)(\d\d\d\d)-(\d-\d\d)(\D)
replace: $1$2-0$3$4
mod: gis

This prevents it from matching the above example but still matches other cases of dates outside of href or src parameters.

The problem is that it runs very slowly. Watching the log output I can see it takes approximately 5 seconds just to run that filter on a page that contains one positive match.

It doesn't seem to be specific to Privoxy because it also runs slowly in RegexBuddy (although not as slow). Can anybody else think of a way to speed this up, or another filter that will achieve the same result without false-positive matches?

Discussion

  • Cattleya

    Cattleya - 2015-05-17

    How about use \s instead negative lookbehind ?
    Search: \s(\d\d\d\d)-(\d-\d\d)(\D)\s
    Replace: $1$2-0$3$4
    Modifier: g
    \s mean match if there is no character like a b c ! @ # but space, I think that is enough for your needs.
    Only g is enough, no need s because simply you are not using .*? .+?

    Some tips for you to tweak RegEx:
    - .* always have ? if possible, so .*?
    - [^?]*? is really nice, ? is > <, character that you don't want to match, good if your want to match HTML
    - Avoid Negative lookbehind if possible.

     

    Last edit: Cattleya 2015-05-17
  • Quaraxkad

    Quaraxkad - 2015-05-18

    I could use \s, and I believe I did in the past. But frequently there are things like Date:2015-05-16. Colon before and a period after, not separated by whitespace.

    Also, I didn't really specifically choose "gis" as modifiers for this regex, I typically use gis on all strings as my default and only remove them or use others if/when needed. I personally think they are better defaults, particularly when matching HTML which is what we do in Privoxy!

     

    Last edit: Quaraxkad 2015-05-18
  • Fabian Keil

    Fabian Keil - 2015-05-21

    The second example filter shouldn't even compile because pcre
    does not support lookbehind assertions with branches of varying
    length below the top-level. For details see the pcrematch manual page.

    If it did compile, I'm not sure why it should not match
    the date in your example. '([^"]+)' could match "path/"
    in which case the lookbehind assertion would be satisfied.

    Usually you can speed up patterns with lookbehind assertions
    by replacing quantifiers like "*" and "+" with hard limits,
    however if the unoptimized pattern doesn't have the intended
    effect, optimizing it for speed is hardly useful.

    For the date normalization I'd probably start with a trivial
    pattern like:
    s@([ >'"\s]\d\d\d\d)-(\d-\d\d\D)@$1-0$2@gs
    and update the allowed characters at the beginning until no
    dates are missed anymore.

     
  • Quaraxkad

    Quaraxkad - 2015-05-30

    The second example compiled because [^"]+ is outside of the lookbehind, and I did that for the exact reason that you pointed out. PCRE does support "varying" lengths in a lookaround but not unknown/unlimited lengths. So (src|href) is ok inside a lookaround, even though src and href are different lengths, they are both exact known lengths.

    There are actually 35 total filters I have to fix all the various forms of incorrect dates. 24 of them are Jan-Dec and January-December. I was hoping to find a valid lookbehind method that I could apply to them all.

    Now I see what you mean about the second not working on the example anyway... What happened there was that I had it working Regex in .NET, because that does support unknown length lookarounds.

    For example in .NET, this:
    (?<!(href|src)="[^"]+)\d\d\d\d-\d-\d\d
    Correctly matches only the final date in this string:
    <a href="/path/2015-2-10.htm"><img src="/path/2015-2-10.jpg"/>2015-2-10</a>

    Then I tried to find a way around PCRE limitations by moving the [^"]+ out of the lookbehind, and that's how it ended up there in my sample above. But as you said, it doesn't always work anyway... It seemed to work in the tests I did in RegexBuddy but not the one I posted here.

    I'll try your method and see how that works. I can't think of any specific scenarios where it won't, but my ideal scenario is a regex that specifically excludes anything surrounded (by any distance) in src/href quotes.

     

    Last edit: Quaraxkad 2015-05-30
  • Fabian Keil

    Fabian Keil - 2016-03-21
    • status: open --> pending
     

Log in to post a comment.