pcre_exec doesn't return PCRE_ERROR_BADUTF8 for invalid UTF8
PERL 5 regular expression pattern matching
Brought to you by:
mish_the_fish
I found some invalid UTF8 chars that do not trigger a PCRE_ERROR_BADUTF8 error when found in the subject string :
\xF4\x92\x94\x95
\xED\xB2\x94
Tested with verion 6.6 and 7.0
See table 3-6 of http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf for valid UTF8 chars.
Because of this bug in libpcre and other implementations I tried, I went and implemented my own is_utf8() function (plus unit test). I'd be happy to send you the function or the unit test if you want it.
Logged In: NO
I don't know who set up this buglist for PCRE, but it was not the official maintainer.
This bug is invalid. I'm afraid you have confused "invalid UTF-8" with "unassigned Unicode point". PCRE checks for the former, not for the latter. The strings you list above are valid UTF-8 sequences for unassigned code points. Incidentally, PCRE 7.0 uses Unicode 5.
Logged In: NO
Oh, forgot to add that the above comment came from Philip Hazel.
Logged In: YES
user_id=1144855
Originator: YES
Please check again, I dont think I got this wrong. I am not talking about unassigned code points.
Unless unicode 5 changed the definition of a "well-formed UTF-8 byte sequence", but as far as I know new unicode versions only assign new code points or refine character properties.
I couldn't find apropriate version 5 documentation, but I'll assume in the rest of my comment that the v4 pdf I gave is still valid for v5 (if you can point me to complete version 5 documentation that proves me wrong, go ahead).
According to the table mentioned above :
* the byte sequence \xED\xB2\x94 is invalid because the second byte is not between \x80 and \x9F.
* The byte sequence \xF4\x92\x94\x95 is invalid because the second byte is not between \x80 and \x8F.
Looks like you're just checking wether the second byte is between \x80 and \xBF, but sometimes it is more restricted !
BTW, I came upon this bug because postgres rejects this data, while you accept it. The postgres UTF8 functions are too complicated for my needs, though.
Logged In: YES
user_id=669310
Originator: NO
Further follow-ups to this bug should be, and has been, directed to the "official" PCRE bugtracker at http://bugs.exim.org/530