Menu

#454 TRegexp pattern match error

6.44
closed
Internal (140)
1
2020-04-29
2020-04-03
No

This issue was reported by Chris Kohlert, and his patch was committed to the trunk by Jogy in [r4781] and merged into 6.44 in [r4782].

:::C++
#include <owl/private/regexp.h>
#include <regex>
#include <cassert>
#include <limits>

auto main() -> int
{
  // With TRegexp, a negated bracket expression matches the implicit null-character at the end of
  // the searched string.
  //
  auto n = size_t{}; // The length of the match.
  const auto i = owl::TRegexp{"a[^b]"}.find("a", &n);
  assert(i == 0 && n == 2); // Incorrect result before the patch [r4781].
  // assert(i == std::numeric_limits<size_t>::max()); // Correct result after the patch [r4781].

  // With standard regular expressions, a negated bracket expression will not match the implicit
  // null-character at the end of the searched string.
  //
  const auto found = std::regex_search("a", std::regex{"a[^b]"});
  assert(found == false);

  return 0;
}

See [discussion:39ed1d9276].

Note: Further issues have since been identified, of which some have been patched. See the discussion below. However, the TRegexp class is deprecated, and the general advise is to port your code to use standard C++ regular expressions instead. See Replacing the Borland C++ Class Libraries.

Related

Commit: [r4781]
Commit: [r4782]
Discussion: 39ed1d9276
Discussion: Some Bugfixes
Discussion: Classes | TRegExValidator example
Wiki: Frequently_Asked_Questions
Wiki: OWLNext_Stable_Releases
Wiki: Replacing_the_Borland_C++_Class_Libraries

Discussion

  • Vidar Hasfjord

    Vidar Hasfjord - 2020-04-03

    Just a note about regular expressions:

    TRegexp is a private implementation of the corresponding class in the deprecated Borland C++ Class Libraries. The C++ standard libraries, since C++11, now includes support for regular expressions, and legacy code should ideally be rewritten to use these facilities instead.

    See Replacing the Borland C++ Class Libraries.

     
  • Chris Kohlert

    Chris Kohlert - 2020-04-09

    i am sorry ;( - i cant find the original bug anymore.
    I tried to force it with the attached test-function but it doesnt seem to work anymore.
    The fix dont hurt, but because the source isnt used anymore anyway, you could remove it if you want.
    the next patches i post will come with working testcases :D :D

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-04-09

    I will have a look. If I cannot reproduce any issue, or see any purpose of the patch, I will revert the source code. Don't worry about it, we have a source code version control system that makes it easy to revert changes. :-)

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-04-09

    I have identified and reproduced the issue. Before the patch, a negated bracket expression matches the implicit null-character at the end of the searched string, which is incorrect. I have updated the ticket description with test code that confirms the issue.

    The patch fixes the bug. Good!

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-04-10

    Here is a little more information about what caused this issue:

    When encountering a bracket expression in the search pattern, e.g. [ab] (which means "match the single character a or b"), TRegexp builds a 256-entry bit-array and sets entry 97 a and 98 b to 1. For a negated bracket expression, e.g. [^ab] (which means "match any character except a and b"), TRegex will do the same — it first builds a bit-array with the bits for a and b set — then flips all the bits in the entire array. This will flip entry 0 \0 to 1, making the negated bracket expression match the null-character. See "regexp.cpp (sourceforge.net)", function "doccl".

    The patch explicitly overrides the bit-array test when matching the null-character (end of string), hence circumvents the problem.

     

    Last edit: Vidar Hasfjord 2020-04-10
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-04-10

    After further review, it appears the same issue presents itself with the wildcard character . — it will match the implicit null-character at the end of the searched string, e.g. a. incorrectly matches a.

    A fix has been applied in [r4789], and the fix has been merged into 6.44 [r4790].

    :::C++
    // With TRegexp, the wildcard character '.' matches the implicit null-character at the end of the
    // searched string.
    //
    auto n = size_t{}; // The length of the match.
    const auto i = owl::TRegexp{"a."}.find("a", &n);
    assert(i == 0 && n == 2); // Incorrect result before the patch [r4789].
    // assert(i == std::numeric_limits<size_t>::max()); // Correct result after the patch [r4789].
    
    // With standard regular expressions, the wildcard character '.' will not match the implicit
    // null-character at the end of the searched string.
    //
    const auto found = std::regex_search("a", std::regex{"a."});
    assert(found == false);
    

    PS. I suspect this fix will need changes in client code, since search patterns with a trailing . wildcard are probably common. Let us know if you run into problems.

     

    Related

    Commit: [r4789]
    Commit: [r4790]


    Last edit: Vidar Hasfjord 2020-04-10
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-04-10

    I've done further testing with wildcard . and negated bracket expressions [^b] in conjunction with Keene closure * and positive closure +.

    With Keene closure *, the results are as expected before and after the patches. However, with positive closure +, the unpatched version of TRegexp will match the implicit null-character at the end of the searched string, and continue to match an arbitrary count beyond the string.

    The patches fix these bugs also.

    :::C++
    // Test the wildcard character '.' in conjunction with Keene closure '*'.
    {
      // With TRegexp, the expression ".*" correctly does not match the implicit null-character at
      // the end of the searched string.
      //
      auto n = size_t{}; // The length of the match.
      const auto i = owl::TRegexp{"a.*"}.find("a", &n);
      assert(i == 0 && n == 1); // Correct result before and after the patch [r4789].
    
      // With standard regular expressions, the result is the same.
      //
      const auto found = std::regex_search("a", std::regex{"a.*"});
      assert(found == true);
    }
    
    // Test the wildcard character '.' in conjunction with positive closure '+'.
    {
      // With TRegexp, the expression ".+" matches the implicit null-character at the end of the
      // searched string, and continues to match an arbitrary count beyond the string.
      //
      auto n = size_t{}; // The length of the match.
      const auto i = owl::TRegexp{"a.+"}.find("a", &n);
    
      if constexpr (!IsPatched)
        assert(i == 0 && n >= 2); // Incorrect result before the patch [r4789].
      else
        assert(i == std::numeric_limits<size_t>::max()); // Correct result after the patch [r4789].
    
      // With standard regular expressions, the expression ".+" will not match the implicit
      // null-character at the end of the searched string.
      //
      const auto found = std::regex_search("a", std::regex{"a.+"});
      assert(found == false);
    }
    
    // Test negated bracket expressions in conjunction with Keene closure '*'.
    {
      // With TRegexp, the expression "[^b]*" correctly does not match the implicit null-character at
      // the end of the searched string.
      //
      auto n = size_t{}; // The length of the match.
      const auto i = owl::TRegexp{"a[^b]*"}.find("a", &n);
      assert(i == 0 && n == 1); // Correct result before and after the patch [r4781].
    
      // With standard regular expressions, the result is the same.
      //
      const auto found = std::regex_search("a", std::regex{"a[^b]*"});
      assert(found == true);
    }
    
    // Test negated bracket expressions in conjunction with positive closure '+'.
    {
      // With TRegexp, the expression "[^b]+" matches the implicit null-character at the end of
      // the searched string, and continues to match an arbitrary count beyond the string.
      //
      auto n = size_t{}; // The length of the match.
      const auto i = owl::TRegexp{"a[^b]+"}.find("a", &n);
    
      if constexpr (!IsPatched)
        assert(i == 0 && n >= 2); // Incorrect result before the patch [r4781].
      else
        assert(i == std::numeric_limits<size_t>::max()); // Correct result after the patch [r4781].
    
      // With standard regular expressions, the expression "[^b]+" will not match the implicit
      // null-character at the end of the searched string.
      //
      const auto found = std::regex_search("a", std::regex{"a[^b]+"});
      assert(found == false);
    }
    
     

    Related

    Commit: [r4781]
    Commit: [r4789]


    Last edit: Vidar Hasfjord 2020-04-10
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-04-10

    Note that the TRegexp implementation assumes that the character set is 7-bit ASCII. It may not work for strings with characters outside this range. This includes extended 8-bit (narrow/ANSI) character sets as well as 16-bit UTF-16 Unicode.

    For example, the implementation reserves the eighth bit of a character to encode the operators in regular expressions (e.g. Keene closure *). And for bracket expressions ("character classes"), a bit-array of only 256 entries is used.

    :::C++
    // Test extended character sets.
    {
      auto n = size_t{}; // The length of the match.
      const auto i = owl::TRegexp{"fårep[^o]lse"}.find("fårepølse", &n);
      assert(i == std::numeric_limits<size_t>::max()); // Incorrect.
    
      const auto found = std::regex_search("fårepølse", std::regex{"fårep[^o]lse"});
      assert(found == true); // Correct.
    }
    
     

    Last edit: Vidar Hasfjord 2020-04-11
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-04-10

    I have now updated all my posted test code to be 64-bit compliant, and I have tested that all the tests, and the patches, do work in 64-bit mode and produce the same results as in 32-bit mode.

    PS. The enclosed test project for Visual Studio 2019 includes all the test code posted, further extended to be compliant with the UNICODE build mode. For readability, I have not updated the code in my posts here in this discussion thread.

     

    Last edit: Vidar Hasfjord 2020-04-11
  • Chris Kohlert

    Chris Kohlert - 2020-04-11

    Awesome find! Thank you very much for the investigation!!!! very nice

     
    👍
    1
  • Vidar Hasfjord

    Vidar Hasfjord - 2020-04-29
    • Status: pending --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB