#223 Number highlighting regex is highly recursive

Brad Jorsch

The regular expression used for number highlighting (around line 2154 in geshi-1.0.X/src/geshi.php) is highly recursive. If the text to be highlighted has long stretches without other tokens, it is easy for this to exceed PCRE's recursion limit or to crash if PCRE's recursion limit is not low enough.

This is, fortunately, quite easy to fix for the common case. Instead of "(?!(?:<DOT>|(?>[^\<]))+>)", use "(?!(?:<DOT>|(?>[^\<]+))+>)" and you're good as long as there are not a huge number of "<DOT>" tokens.


  • BenBE

    BenBE - 2013-04-13

    Could you please prepare me a patch for this? Basically finding the regex in the source wouldn't be too hard, yet due to the escaping and the behavioural changes I'd prefer to have a patch that I could apply easily.

    Could you also check if there are other places with regular expressions that could cause this issue?

  • BenBE

    BenBE - 2013-04-13
    • status: open --> open-accepted
    • assigned_to: BenBE
    • Priority: 5 --> 6
  • Brad Jorsch

    Brad Jorsch - 2013-04-15

    Patch attached. You can test it by using a file containing "1 " repeated a very large number of times for a language like C.

    As for other regexes with the issue: Anything that has a + or * quantifier on a group containing an alternation is suspect, if the quantifier can match a large number of times. The "posessive" quantifiers ++ and *+ seem to be fine, and if nothing else you could always use {1,500} and {0,500} instead to place an upper limit on the recursion.

    I see a regex inside get_language_fullname() that looks suspicious, but the language name should be short enough to not break things.

    And I see two instances in a regex near the comment "FIX for symbol highlighting..." in parse_non_string_part(). The first I don't know if it could be overly long or not. The second, (?:" . $this->language_data['SYMBOL_SEARCH'] . ")+, can easily be a problem if someone passes in code to be highlighted that has a very large number of symbols in a row (e.g. a file containing "^" repeated a very large number of times for C). I don't know whether making that "++" would be an appropriate fix or if backtracking is necessary so something like {1,500} would be better.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks