This issue was reported by Chris Kohlert, and his patch was committed to the trunk by Jogy in [r4781] and merged into 6.44 in [r4782].
:::C++
#include <owl/private/regexp.h>
#include <regex>
#include <cassert>
#include <limits>
auto main() -> int
{
// With TRegexp, a negated bracket expression matches the implicit null-character at the end of
// the searched string.
//
auto n = size_t{}; // The length of the match.
const auto i = owl::TRegexp{"a[^b]"}.find("a", &n);
assert(i == 0 && n == 2); // Incorrect result before the patch [r4781].
// assert(i == std::numeric_limits<size_t>::max()); // Correct result after the patch [r4781].
// With standard regular expressions, a negated bracket expression will not match the implicit
// null-character at the end of the searched string.
//
const auto found = std::regex_search("a", std::regex{"a[^b]"});
assert(found == false);
return 0;
}
Note: Further issues have since been identified, of which some have been patched. See the discussion below. However, the TRegexp class is deprecated, and the general advise is to port your code to use standard C++ regular expressions instead. See Replacing the Borland C++ Class Libraries.
Commit: [r4781]
Commit: [r4782]
Discussion: 39ed1d9276
Discussion: Some Bugfixes
Discussion: Classes | TRegExValidator example
Wiki: Frequently_Asked_Questions
Wiki: OWLNext_Stable_Releases
Wiki: Replacing_the_Borland_C++_Class_Libraries
Just a note about regular expressions:
TRegexp is a private implementation of the corresponding class in the deprecated Borland C++ Class Libraries. The C++ standard libraries, since C++11, now includes support for regular expressions, and legacy code should ideally be rewritten to use these facilities instead.
See Replacing the Borland C++ Class Libraries.
i am sorry ;( - i cant find the original bug anymore.
I tried to force it with the attached test-function but it doesnt seem to work anymore.
The fix dont hurt, but because the source isnt used anymore anyway, you could remove it if you want.
the next patches i post will come with working testcases :D :D
I will have a look. If I cannot reproduce any issue, or see any purpose of the patch, I will revert the source code. Don't worry about it, we have a source code version control system that makes it easy to revert changes. :-)
I have identified and reproduced the issue. Before the patch, a negated bracket expression matches the implicit null-character at the end of the searched string, which is incorrect. I have updated the ticket description with test code that confirms the issue.
The patch fixes the bug. Good!
Here is a little more information about what caused this issue:
When encountering a bracket expression in the search pattern, e.g.
[ab](which means "match the single characteraorb"), TRegexp builds a 256-entry bit-array and sets entry 97aand 98bto 1. For a negated bracket expression, e.g.[^ab](which means "match any character exceptaandb"), TRegex will do the same — it first builds a bit-array with the bits foraandbset — then flips all the bits in the entire array. This will flip entry 0\0to 1, making the negated bracket expression match the null-character. See "regexp.cpp (sourceforge.net)", function "doccl".The patch explicitly overrides the bit-array test when matching the null-character (end of string), hence circumvents the problem.
Last edit: Vidar Hasfjord 2020-04-10
After further review, it appears the same issue presents itself with the wildcard character
.— it will match the implicit null-character at the end of the searched string, e.g.a.incorrectly matchesa.A fix has been applied in [r4789], and the fix has been merged into 6.44 [r4790].
PS. I suspect this fix will need changes in client code, since search patterns with a trailing
.wildcard are probably common. Let us know if you run into problems.Related
Commit: [r4789]
Commit: [r4790]
Last edit: Vidar Hasfjord 2020-04-10
I've done further testing with wildcard
.and negated bracket expressions[^b]in conjunction with Keene closure*and positive closure+.With Keene closure
*, the results are as expected before and after the patches. However, with positive closure+, the unpatched version of TRegexp will match the implicit null-character at the end of the searched string, and continue to match an arbitrary count beyond the string.The patches fix these bugs also.
Related
Commit: [r4781]
Commit: [r4789]
Last edit: Vidar Hasfjord 2020-04-10
Note that the TRegexp implementation assumes that the character set is 7-bit ASCII. It may not work for strings with characters outside this range. This includes extended 8-bit (narrow/ANSI) character sets as well as 16-bit UTF-16 Unicode.
For example, the implementation reserves the eighth bit of a character to encode the operators in regular expressions (e.g. Keene closure
*). And for bracket expressions ("character classes"), a bit-array of only 256 entries is used.Last edit: Vidar Hasfjord 2020-04-11
I have now updated all my posted test code to be 64-bit compliant, and I have tested that all the tests, and the patches, do work in 64-bit mode and produce the same results as in 32-bit mode.
PS. The enclosed test project for Visual Studio 2019 includes all the test code posted, further extended to be compliant with the UNICODE build mode. For readability, I have not updated the code in my posts here in this discussion thread.
Last edit: Vidar Hasfjord 2020-04-11
Awesome find! Thank you very much for the investigation!!!! very nice